Assistance during audio and video calls

ABSTRACT

Implementations relate to providing information items for display during a communication session. In some implementations, a computer-implemented method includes receiving, during a communication session between a first computing device and a second computing device, first media content from the communication session. The method further includes determining a first information item for display in the communication session based at least in part on the first media content. The method further includes sending a first command to at least one of the first computing device and the second computing device to display the first information item.

CROSS-REFERENCE TO RELATED APPLICATIONS

The present application is a continuation of U.S. Pat. Application No.17/031,416, filed Sep. 24, 2020 and titled ASSISTANCE DURING AUDIO ANDVIDEO CALLS, which is a continuation of U.S. Pat. Application No.15/953,266, filed Apr. 13, 2018 and titled ASSISTANCE DURING AUDIO ANDVIDEO CALLS (now U.S. Pat. No. 10,791,078), which claims priority toU.S. Provisional Pat. Application No. 62/538,764, filed Jul. 30, 2017and titled ASSISTANT DURING AUDIO AND VIDEO CALLS, the contents of allof which are incorporated herein by reference in their entirety.

BACKGROUND

Communication sessions using computing devices, e.g., one-to-one audioand video calls, audio conferences, video conferences, text messaging,etc. are popular. Users engage in communication sessions for a varietyof purposes, e.g., to communicate with friends and family, to conductbusiness meetings, to share images, audio, video, computer files, etc.Communication sessions using computing devices enable users located indifferent geographic locations to easily communicate with each other.

The background description provided herein is for the purpose ofgenerally presenting the context of the disclosure. Work of thepresently named inventors, to the extent it is described in thisbackground section, as well as aspects of the description that may nototherwise qualify as prior art at the time of filing, are neitherexpressly nor impliedly admitted as prior art against the presentdisclosure.

SUMMARY

Some implementations can include a computer-implemented method. Themethod can include receiving, during a communication session between afirst computing device and a second computing device, first mediacontent from the communication session. The method can also includebased at least in part on the first media content, determining a firstinformation item for display in the communication session. The methodcan further include sending a first command to at least one of the firstcomputing device and the second computing device to display the firstinformation item.

Receiving the first media content from the communication session caninclude receiving respective media content from the first computingdevice and from the second computing device. The method can also includereceiving second media content from the communication session, thesecond media content can be generated in the communication sessionsubsequent to the first media content, and, based at least in part onthe second media content, determining a second information item for thecommunication sessions. The method can further include sending a secondcommand to the first computing device and the second computing device todisplay the second information item.

The communication session can include a synchronous communicationsession, and receiving the first media content can include receiving atleast one of: audio, video, and text from the communication session. Themethod can also include sending the first information item to at leastone of the first computing device and the second computing device.Determining the first information item can be based on contextualinformation.

The method can further include sending a request for the contextualinformation to at least one of the first computing device and the secondcomputing device, and receiving the contextual information. The firstcommand can be configured to cause display of a user interface thatincludes the first information item. The communication session caninclude video that includes a face, and the first command can causedisplay of the user interface such that the face is not obscured.

The first information item can be associated with a first user of thefirst computing device, and the first command can be sent to the firstcomputing device and not sent to the second computing device. The firstcommand can be configured to cause display of a user interface that aselectable user interface element that enables the first user to providepermission to share the first information item in the communicationsession. The method can also include receiving, from the first computingdevice, an indication that the first user has provided the permission toshare the first information item, and, in response to receiving theindication, sending a third command to the second computing device todisplay the second information item.

The method can be implemented by a third computing device distinct fromthe first computing device and the second computing device, and thecommunication session can include audio and video exchanged between thefirst computing device and the second computing device. Receiving thefirst media content can include receiving the audio.

The method can be implemented by an assistant application on the firstcomputing device, and the method can further include providing a visualindicator that the assistant application is active. The method canfurther include receiving, during the communication session, a usercommand to disable the assistant application, and, in response to theuser command, disabling the assistant application. The assistantapplication can be part of an application program executing on the firstcomputing device that provides the communication session.

The method can also include sending a permission command to the firstcomputing device and the second computing device, the permission commandconfigured to cause display of a permission user interface that enablesa first user of the first computing device and a second user of thesecond computing device to provide respective permission indications,prior to receiving the first media content from the communicationsession. The method can further include receiving the respectivepermission indications, and determining whether each of the respectivepermission indications include a user permission to receive the firstmedia content. Receiving the first media content from the communicationsession may not be performed if at least one of the respectivepermission indications do not include the user permission.

The method can also include detecting whether first media contentincludes an invocation phrase. The determining and the sending can beperformed if it is determined the first media content includes theinvocation phrase.

Receiving the first media content can include receiving alocally-generated representation of user activity within thecommunication session from the first computing device and the secondcomputing device. The locally-generated representation can be based onat least one of: audio, video, and text transmitted by the respectivecomputing device during the communication session.

The communication session can include video that includes a face, andthe first command can cause display of a user interface at a particularposition relative to the face. The method can be implemented by a thirdcomputing device that participates in the communication session, thethird computing device distinct from the first computing device and thesecond computing device. The method can also include receiving, duringthe communication session, a user command to disconnect from thecommunication session, and in response to the user command, removing thethird computing device from the communication session.

The first information item can include at least one of audio and video,and the first command can be configured to cause playback of at leastone of the audio and the video. The first information item cancorrespond to an interactive application, and the first command can beconfigured to cause display of an interactive user interface of theinteractive application. First media content can include audio from thecommunication session, and the first information item can include atleast one of: a text transcript of the audio, and a translation of theaudio. First media content can include video from the communicationsession, and the first information item can be determined based onrecognizing an object in the video.

Some implementations can include a non-transitory computer readablemedium with instructions stored thereon that, when executed by ahardware processor, cause the hardware processor to perform operations.The operations can include receiving, during a communication sessionbetween a first computing device and a second computing device, firstmedia content from the communication session; for example, from thefirst computing device. The operations can further include receiving,during the communication session, second media content from thecomputing device, and determining a first information item for displayin the communication session, based at least in part on the first mediacontent and the second media content. The operations can also includesending a first command to the first computing device and the secondcomputing device to display the first information item.

The second media content can be received subsequent to the first mediacontent. The first command can be configured to cause display of a userinterface that includes the first information item on the firstcomputing device and the second computing device.

Some implementations can include a system comprising a hardwareprocessor, and a memory coupled to the hardware processor withinstructions stored thereon that, when executed by the hardwareprocessor, cause the hardware processor to perform operations.

The operations can include receiving, during a communication sessionbetween a first computing device and a second computing device, firstmedia content from the communication session. For example, receiving thefirst media content from the communication session comprises receivingmedia content from the first computing device and second media contentfrom the second computing device. The second media content can bereceived subsequent to the first media content. The operations can alsoinclude determining a first information item for display in thecommunication session, based at least in part on the first media contentand the second media content. The operations can further include sendinga first command to the first computing device and the second computingdevice to display the first information item.

Determining the first information item can be based on contextualinformation. The instructions can cause the hardware processor toperform further operations including sending a request for thecontextual information to at least one of the first computing device andthe second computing device, and receiving the contextual informationfrom the at least one of the first computing device and the secondcomputing device.

BRIEF DESCRIPTION OF DRAWINGS

FIG. 1 is a block diagram of an example network environment which may beused for one or more implementations described herein;

FIG. 2A is a block diagram illustrating an example configuration 200 inwhich a communication session with assistance may be provided, accordingto some implementations;

FIG. 2B is a block diagram illustrating another example configuration220 in which a communication session with assistance may be provided,according to some implementations;

FIG. 2C is a block diagram illustrating another example configuration230 in which a communication session with assistance may be providedaccording to some implementations;

FIG. 3 is a flow diagram illustrating an example method 300 to provideinformation items during a communication session, according to someimplementations;

FIG. 4 is a flow diagram illustrating an example method 400 to determineinformation items during a communication session, according to someimplementations;

FIG. 5A is a diagrammatic illustration of an example user interface 500,according to some implementations;

FIG. 5B is a diagrammatic illustration of an example user interface 520,according to some implementations;

FIG. 5C is a diagrammatic illustration of an example user interface 540,according to some implementations;

FIG. 6A is a diagrammatic illustration of an example user interface 600,according to some implementations;

FIG. 6B is a diagrammatic illustration of an example user interface 620,according to some implementations;

FIG. 6C is a diagrammatic illustration of an example user interface 640,according to some implementations;

FIG. 6D is a diagrammatic illustration of an example user interface 660,according to some implementations;

FIG. 7 is a diagrammatic illustration of an example user interface 700,according to some implementations;

FIG. 8 is a diagrammatic illustration of an example user interface 800,according to some implementations; and

FIG. 9 is a block diagram of an example device which may be used for oneor more implementations described herein.

DETAILED DESCRIPTION

Implementations of the subject matter in this application relate toproviding assistance during a computer-mediated communication sessionconducted between participant users. Providing assistance may includeproviding information items that are suitable in the context of aconversation between participants in the communication session. Forexample, information items may include photos, audio, video, computerapplications, etc. that are determined based on context of theconversation.

In some implementations, user permission is obtained for an assistantapplication to determine context based on media content during thecommunication session, e.g., audio, video, and/ or text exchangedbetween participant users. In some implementations, the assistantapplication may determine context based on speech or text input providedby participants in the session, gesture input provided by participantsin the session, etc. In some implementations, the assistant applicationmay be invoked by a participant, e.g., by uttering an invocation phrase.

In some implementations, the assistant application may access publicinformation sources, e.g., on the Internet, sources shared betweenparticipants, and if permitted by each participant user, respective userdata of the participant users. The assistant application may retrieveinformation items, e.g., photos, documents, maps, recipes, restaurantinformation, sports scores, schedule information, etc. that match theconversation context from the information sources. The assistantapplication provides the information items to participants in thecommunication session. In implementations where the assistantapplication provides information items that are not shared with otherparticipant users in the communication session, the assistantapplication obtains user permission prior to providing such informationitems in the communication session.

In some implementations, the assistant application may be turned on oroff based on a user command. In some implementations, the assistantapplication may provide information items that are based on mediacontent exchanged between participants during the communication session,e.g., a text transcript of audio exchanged during the communicationsession, a translation of speech exchange during the communicationsession, etc. In some implementations, the assistant application mayprovide a participant user in a communication session with a userinterface that enables the participant user to view differentconversation contexts and corresponding information items provided bythe assistant application in a stacked or chronological manner, duringand after the communication session.

In situations in which certain implementations discussed herein maycollect or use personal information about users (e.g., user data,information about a user’s social network, user’s location and time atthe location, user’s biometric information, user’s activities anddemographic information), users are provided with one or moreopportunities to control whether information is collected, whether thepersonal information is stored, whether the personal information isused, and how the information is collected about the user, stored andused. That is, the systems and methods discussed herein collect, storeand/or use user personal information specifically upon receivingexplicit authorization from the relevant users to do so. For example, auser is provided with control over whether programs or features collectuser information about that particular user or other users relevant tothe program or feature. Each user for which personal information is tobe collected is presented with one or more options to allow control overthe information collection relevant to that user, to provide permissionor authorization as to whether the information is collected and as towhich portions of the information are to be collected. For example,users can be provided with one or more such control options over acommunication network. In addition, certain data may be treated in oneor more ways before it is stored or used so that personally identifiableinformation is removed. As one example, a user’s identity may be treatedso that no personally identifiable information can be determined. Asanother example, a user’s geographic location may be generalized to alarger region so that the user’s particular location cannot bedetermined.

FIG. 1 illustrates a block diagram of an example network environment100, which may be used in some implementations described herein. In someimplementations, network environment 100 includes one or more serversystems, e.g., server system 102 and server system 140 in the example ofFIG. 1 . Server systems 102 and 140 can communicate with a network 130,for example. Server system 102 can include a server device 104 and adatabase 106 or other storage device. Server system 140 can include aserver device 142 and a database 146 or other storage device. In someimplementations, server device 104 may provide a communicationapplication 152 b. Further, in some implementations, server device 104may provide an assistant application 158 b and/or server device 142 mayprovide an assistant application 158 c.

Network environment 100 also can include one or more client devices,e.g., client devices 120, 122, 124, and 126, which may communicate witheach other and/or with server system 102 via network 130. In someimplementations, client devices 120-126 may communicate each otherdirectly such that the communications between the client devices are notrouted via a server system. In some implementations, client devices120-126 may communicate with each other via a server system, e.g.,server system 102. Network 130 can be any type of communication network,including one or more of the Internet, local area networks (LAN),wireless networks, switch or hub connections, etc. In someimplementations, network 130 can include peer-to-peer communicationbetween devices, e.g., using peer-to-peer wireless protocols (e.g.,Bluetooth®, Wi-Fi Direct, etc.), etc. One example of peer-to-peercommunications between two client devices 120 and 122 is shown by arrow132.

For ease of illustration, FIG. 1 shows one block for server system 102,server device 104, database 106, server system 140, server device 142,and database 146, and shows four blocks for client devices 120, 122,124, and 126. Server blocks 102, 104, 106, 140, 142, and 146 mayrepresent multiple systems, server devices, and network databases, andthe blocks can be provided in different configurations than shown. Insome implementations, server systems 102 and 146 may be controlledand/or operated by different owners or parties. For example, serversystem 102 may provide a communication application 152 b from a firstprovider and an assistant application 158 b from the first provider.Server system 140, controlled by a second provider that does not providea communication application, may provide an assistant application 158 cthat can participate in a communication session provided by thecommunication application 152 b.

For example, server systems 102 and 140 can represent multiple serversystems that can communicate with other server systems via the network130. In some implementations, server systems 102 and 140 can includecloud hosting servers, for example. In some examples, databases 106,146and/or other storage devices can be provided in server system block(s)that are separate from server devices 104 and 142, and can communicatewith server devices 104, 142, and other server systems via network 130.

Also, there may be any number of client devices. Each client device canbe any type of electronic device, e.g., desktop computer, laptopcomputer, portable or mobile device, cell phone, smart phone, tabletcomputer, television, TV set top box or entertainment device, homespeaker, videoconferencing systems, wearable devices (e.g., displayglasses or goggles, wristwatch, headset, armband, jewelry, etc.),personal digital assistant (PDA), media player, game device, etc. Someclient devices may also have a local database similar to database 106 orother storage. In some implementations, network environment 100 may nothave all of the components shown and/or may have other elementsincluding other types of elements instead of, or in addition to, thosedescribed herein.

In various implementations, end-users U1, U2, U3, and U4 may communicatewith server system 102 and/or each other using respective client devices120, 122, 124, and 126. In some examples, users U1, U2, U3, and U4 mayinteract with each other via applications running on respective clientdevices and/or server system 102, and/or via a network service, e.g., asocial network service, a communication application, or other type ofnetwork service, implemented on server system 102. For example,respective client devices 120, 122, 124, and 126 may communicate data toand from one or more server systems, e.g., systems 102 and/or 140. Insome implementations, the server systems 102 and/or 140 may provideappropriate data to the client devices such that each client device canreceive communicated content or shared content uploaded to the serversystem 102 and/or 140.

In some examples, users U1-U4 can interact via audio or videoconferencing, audio, video, or text chat, or other communication modesor applications, e.g., communication applications 152 a and 152 b. Anetwork service implemented by server system 102 can include a systemallowing users to perform a variety of communications, form links andassociations, upload and post shared content such as images, text,video, audio, and other types of content, and/or perform otherfunctions. For example, a client device can display received data suchas content posts sent or streamed to the client device and originatingfrom a different client device via a server and/or network service (orfrom the different client device directly), or originating from a serversystem and/or network service. In some implementations, client devicescan communicate directly with each other, e.g., using peer-to-peercommunications between client devices as described above. In someimplementations, a “user” can include one or more programs or virtualentities, as well as persons that interface with the system or network.

In some implementations, any of client devices 120, 122, 124, and/or 126can provide one or more applications. For example, as shown in FIG. 1 ,client device 120 may provide communication application 152 a, assistantapplication 158 a, and one or more other applications 154. Clientdevices 122-126 may also provide similar applications. For example,communication application 152 a may provide a user of a respectiveclient device (e.g., users U1-U4) with the ability to engage in acommunication session with one or more other users. In someimplementations, the communication session may be a synchronouscommunication session in which all participants are present at the sametime. In some implementations, the communication session may includeaudio and/or video from each respective participant such that otherparticipants can see and/or hear the respective participant during thecommunication session via a display screen and/or an audio speaker oftheir respective client devices. In some implementations, thecommunication may include text exchanged between participants,alternatively or in addition to audio and/or video.

In some implementations, one or more client devices may include anassistant application, e.g., the assistant application 158 a. In someimplementations, where participants provide consent to use of assistantapplication during a communication session, one or more of assistantapplication 158 a, assistant application 158 b, and assistantapplication 158 c may analyze media content exchanged by theparticipants in the communication session to determine one or moreinformation items to be provided to the participants during thecommunication session.

In some implementations, multiple assistant applications may be activeduring a communication session, e.g., one or more of assistantapplications 158 a, 158 b, and 158 c may be active and provideassistance during a communication session. In some implementations,different assistant applications may be configured to provide assistancein similar or different contexts. For example, one assistant applicationmay include translation functionality, and may be invoked in response toa user request to provide translations of speech in audio contentprovided by participants in a communication session. In the samecommunication session, a second assistant application that providescontextual assistance, e.g., retrieves information from a sharedrepository of documents that are shared between participant users, maybe active and provide assistance.

In some implementations, multiple assistant applications may interactwith each other. For example, the assistant application that providestranslations may provide input to the assistant application thatprovides contextual assistance, e.g., provide user speech in a languagethat the second assistant application can parse. In someimplementations, assistant applications may interact with each other ina manner similar or different to how the assistant applications interactwith human participants in the communication session. For example, thetranslation assistant application may utilize text-to-speech technologyto provide speech output in a target language that is understood by ahuman participant. In this example, the translation assistantapplication may provide translated text directly to the second assistantapplication in addition to, or alternatively to providing speech. Insome implementations, if multiple applications provide similarfunctionality, the user can indicate a preference for a particularassistant application of the multiple applications. In someimplementations, the preference may be an explicitly indicatedpreference for the particular assistant application. In someimplementations, when the users permit use of user data, the preferencemay be determined as an implicitly indicated preference, e.g. by ranking(or ordering) assistant applications based on various factors, e.g. userselection of the particular assistant application, user feedback oraction based on the assistance provided, e.g., the user chooses anoption provided by the particular assistant application, and not byother assistant applications, the user provides indication of usersatisfaction, etc. as determined based on user interaction such as mouseclicks, taps, etc.

In some implementations, a provider of the communication application,e.g., that operates server device 104, may provide an assistantapplication. In some implementations, a third-party different from theparticipant users and the provider of the communication application,e.g., a third-party that operates server device 142, may provide anassistant application. In implementing the communication session withassistance, access to user data including media content exchanged duringthe communication session and other user data is provided to theassistant application upon specific permission from participant users.If multiple assistant applications are available, participant users areprovided with options to select one or more particular assistantapplications. Further, users are provided with options to control theuser data that each assistant application is permitted to access,including options to deny access to user data to particular assistantapplications.

In some implementations, client device 120 may include one or more otherapplications 154. For example, other applications 154 may beapplications that provide various types of functionality, e.g.,calendar, address book, e-mail, web browser, shopping, transportation(e.g., taxi, train, airline reservations, etc.), entertainment (e.g., amusic player, a video player, a gaming application, etc.), socialnetworking (e.g., sharing images/ video, etc.), and so on. In someimplementations, one or more of other applications 154 may be standaloneapplications that execute on client device 120. In some implementations,one or more of other applications 154 may access a server system thatprovides data and/or functionality of applications 154.

A user interface on a client device 120, 122, 124, and/or 126 can enabledisplay of user content and other content, including images, video,data, and other content as well as communications, privacy settings,notifications, and other data. Such a user interface can be displayedusing software on the client device, software on the server device,and/or a combination of client software and server software executing onserver device 104 and/or server device 142, e.g., application softwareor client software in communication with server system 102 and/or serversystem 140. The user interface can be displayed by a display device of aclient device or server device, e.g., a touchscreen or other displayscreen, projector, etc. In some implementations, application programsrunning on a server system can communicate with a client device toreceive user input at the client device and to output data such asvisual data, audio data, etc. at the client device.

In some implementations, any of server system 102, and/or one or moreclient devices 120-126 can provide a communication application orcommunication program. The communication program may allow a system(e.g., client device or server system) to provide options forcommunicating with other devices. The communication program can provideone or more associated user interfaces that are displayed on a displaydevice associated with the server system or client device. The userinterface may provide various options to a user to select communicationmodes, users or devices with which to communicate, e.g., initiate orconduct a communication session, etc.

FIG. 2A is a block diagram illustrating an example configuration 200 inwhich a communication session with assistance may be provided, accordingto some implementations. In the example scenario illustrated in FIG. 2A,client device 120 and client device 122 are engaged in a communicationsession 202 that enables users U1 and U2 to communicate with each other,e.g., exchange media content such as audio and/or video. For example,user U1 may utilize a camera of client device 120 to capture video and amicrophone of client device 120 to capture audio. The captured videoand/or audio may be transmitted to client device 122 in thecommunication session 202. Communication session 202 is provideddirectly between client device 120 and client device 122 in a directmanner, such that server device 104 does not mediate the session. Forexample, communication session 202 may be conducted even when serverdevice 104 is absent. Client devices 120 and 122 can communicate witheach other via network 130 or can communicate directly (e.g., in apeer-to-peer manner). Communication applications in respective clientdevices 120 and 122 are configured to provide functionality to enableusers U1 and U2 to exchange video, audio, other media, and/or text, incommunication session 202. In the example illustrated in FIG. 2A,neither of client devices 120 and 122 is configured with an assistantapplication.

In implementations where participants in a communication session, e.g.,users U1 and U2 in communication session 202, provide consent forautomatic assistance, media content from the communication session 202may be sent to server device 104 that includes assistant application 158b. For example, participants may be provided with options regardingmedia content to send to the server device 104, e.g., each participantcan choose to provide audio only, video only, audio and video, etc. fromthe communication session to server device 104. Provision of mediacontent to server device 104 is restricted in such a manner that serverdevice 104 can utilize the media content in assistant application 158 b.Other applications on server device 158 b are denied access to the mediacontent, or are provided access to the media content upon specificpermission from respective users. If a participant denies use of theirmedia content, such content is not provided to the server device 104. Invarious implementations, media content from respective client devices isencrypted such that it is readable by assistant application 158 b.

Assistant application 158 b on server device 104 may analyze mediacontent from the communication session to identify one or moreinformation items to be provided during the communication session. Whensuch information items are identified, server device 104 may send theinformation items and a command to display the information item(s) oneor more of client device 120 and 122. In some implementations, clientdevices 120 and 122 may display the information item(s) received fromserver device 104. In different implementations, an information item maybe text, image(s), audio, video, web page(s), software application(s),etc. The command from the server may cause the client device to displaya user interface that includes the information item(s), e.g., providetext, image, video, or web page(s) on a screen of the client device,play audio via a speaker or other available audio device, display a userinterface of the software application on the client device, etc.

While FIG. 2A illustrates server device 104 that provides assistantapplication 158 b, in some implementations, server device 142 andassistant application 158 c may be used alternatively or in addition toserver device 104 and assistant application 158 b. For example, usersthat participate in a communication session may be provided withoptions, e.g., conduct session without use of an assistant application,use assistant application 158 b, use assistant application 158 c, useboth assistant application 158 b and assistant application 158 c, etc.If users select to conduct the communication session without use of anassistant application, media content from the communication session isnot sent to a server. If users select one or more of assistantapplications 158 b and 158 c, media content from the communicationsession is provided to the selected assistant applications. In someimplementations, it is possible to have multiple assistant applications.In the case of multiple assistant applications, the user can select oneor more of the multiple assistant applications. Users are also providedoptions to opt out from assistance. Further, users are provided withoptions to disable individual assistant applications or disableassistance features entirely.

FIG. 2B is a block diagram illustrating another example configuration220 in which a communication session with assistance may be provided,according to some implementations. In the example scenario illustratedin FIG. 2B, client device 120 and client device 122 are engaged in acommunication session 222 that enables users U1 and U2 to communicatewith each other, e.g., exchange media content such as audio and/orvideo. For example, user U1 may utilize a camera of client device 120 tocapture video and a microphone of client device 120 to capture audio.The captured video and/or audio may be transmitted to client device 122in the communication session 222.

Communication session 222 is provided between client device 120 andclient device 122 in a direct manner, e.g., without use a communicationapplication on a server system. Client devices 120 and 122 cancommunicate with each other via network 130 or can communicate directly(e.g., in a peer-to-peer manner). Communication applications inrespective client devices 120 and 122 are configured to providefunctionality to enable users U1 and U2 to exchange video, audio, othermedia, and/or text, in communication session 222.

In the example illustrated in FIG. 2B, one or more of client devices 120and 122 is configured with an assistant application, e.g., assistantapplication 158 a. In some implementations, assistant application 158 amay be part of communication application 152 a. In some implementations,assistant application 158 a may be a standalone application distinctfrom communication application 152 a. In some implementations, assistantapplication 158 a may be part of an operating system of client device120. In some implementations, assistant application 158 a may beimplemented in a modular manner, such that a portion of the assistantapplication is part communication application 152 a, one or more otherapplications 154, an operating system of client device 120, etc. In thescenario illustrated in FIG. 2B, assistant application is a localapplication that executes on a client device. In some implementations,different client devices may be configured with different assistantapplications, and participants in a communication session may beprovided with options to choose a particular assistant application, usemultiple assistant applications, etc.

During communication session 222, when users consent to use of assistantapplication 158 a, assistant application 158 a may receive and analyzemedia content from the communication session 222. For example, assistantapplication 158 a may receive media content similar to server assistantapplication 158 b, as described above. Assistant application 158 a maydetermine one more information items, e.g., information item 224, andprovide it in the communication session 222, similar to information item208 as described above.

FIG. 2C is a block diagram illustrating another example configuration230 in which a communication session with assistance may be providedaccording to some implementations. In the example scenario illustratedin FIG. 2C, client device 120, client device 122, and client device 124are engaged in a communication session that enables users U1, U2, and U3to communicate with each other, e.g., exchange media content such asaudio and/or video.

As illustrated in FIG. 2C, media content from the communication sessionbetween client devices 120, 122, and 124 is mediated by server device104. The communication session is provided by communication application152 b on server device 104. Communication application 152 b maycoordinate exchange of media content between various participants, e.g.,users U1, U2, and U3, in the communication session, by receiving mediacontent from each client device and transmitting the received mediacontent to other client devices that are in the communication session.

In some implementations, one or more of client devices 120, 122, and 124may also be configured with a communication application 152 a. In theseimplementations, communication application 152 a may be aclient-application that enables each respective client device totransmit and receive media content for the communication session. Insome implementation, client-side application may be omitted, e.g., thecommunication session may be provided on a webpage in a browser suchthat client devices need not be configured with communicationapplication 152 a to participate in the communication session.

Respective media content captured by each of client devices 120, 122,and 124 is sent to server device 104. Communication application 152 b onserver device 104 transmits respective media content received from eachclient device to other client devices that participate in the session.In some implementations, media content is transmitted in the form of anaudio and/or video stream. The audio/video stream sent by server device104 to each client device may include media content received from otherclient devices in the communication session.

In the example illustrated in FIG. 2C, server device 104 is alsoconfigured with assistant application 158 b. If users that participatein a communication session, e.g., users U1, U2, and U3, choose to enableassistant application 158 b, server device 104 may provide respectivemedia content received from each device to assistant application 158 b.While FIG. 2C illustrates communication application 152 b and assistantapplication 158 b as part of server device 104, in some implementations,assistant application provided by a different server device, e.g.,assistant application 158 c provided by server device 142 may be used toprovide assistance in the communication session, alternatively or inaddition to assistant application 158 b. In some implementations,assistant application 158 b may be a part of communication application152 b. In some implementations, assistant application 158 b may beseparate from communication application 152 b. In some implementations,different server devices may be configured with different assistantapplications, and participants in a communication session may beprovided with options to choose a particular assistant application, usemultiple assistant applications, etc.

During the communication session, when users consent to use of assistantapplication 158 b, assistant application 158 b may receive and analyzemedia content. Assistant application 158 b may determine one moreinformation items, e.g., information item 238, and provide theinformation item in the communication session, similar to informationitem 208 as described above.

In different implementations, any combination of one or more clientassistant applications 158 a, and server assistant applications 158 band 158 c, may be provided to determine and provide information items ina communication session. Different implementations may provide differenttechnical benefits.

For example, the configuration illustrated in FIG. 2A may beadvantageous, e.g., due to separation of communication session from theprovision of assistant application. For example, by sending mediacontent from a client device in the communication session 202 inparallel to other client devices in the communication session and serverdevice 104, provision of assistance is separated from exchange of mediacontent between participants. This may be beneficial, since delays incommunications to and from server device 104, do not affectcommunication between participants. Further, if participants incommunication session 202 choose to disable assistance features,transmission of media content 204 and 206 to server device 104 can bestopped, resulting in bandwidth savings. If participants choose toenable assistance features, transmission of media content 204 and 206may be resumed. Further, assistant application 158 b on server device104 may benefit from the greater computational resources available onserver device 104, in comparison to client devices 120-124. Inimplementations where users permit use of use data for assistance,server device 104 may access database 106 to retrieve user data for useby assistant application 158 b. This may be advantageous, e.g., whenusers store user data such as photos/videos, calendar, documents, etc.on server device 104. In this configuration, assistant application 158 bmay also be updated, e.g., as new assistant features are developed,without need to update client devices 120-124.

In some implementations, the configuration illustrated in FIG. 2B mayrequire less bandwidth, since there is no parallel transmission of mediacontent to a server device. Further, when user provide permission toaccess and utilize user data for assistance features, an assistantapplication 158 a on client device 120 may be able to retrieve localuser data quickly and use such data to determine information item 224.Further, when users permit assistant application 158 a to accesscontextual information, e.g., user location, recent user activity onclient device 120, etc., assistant application 158 a can convenientlyretrieve such data locally.

In some implementations, the configuration illustrated in FIG. 2C mayprovide several advantages. The configuration in FIG. 2C utilizes serverdevice 104 to provide the communication session and the assistantapplication. In this configuration, media content is available on serverdevice 104, and may directly be utilized by assistant application 158 b,when users choose enable assistance features.

While FIG. 2A and FIG. 2B illustrate a communication session thatincludes two client devices, and FIG. 2C illustrates a communicationsession that includes three client devices, it will be understood thatassistance may be provided in a communication session that includes anynumber of client devices. Further, while each of client devices 120-124is illustrated as being associated with a single user from users U1-U4,it may be understood that a client device in a communication session maybe associated with multiple participants. For example, if a clientdevice is a video conferencing system in a meeting room, the clientdevice may be determined as associated with a plurality of participantusers that are present in the meeting room. In another example, if twodifferent users are in a field of view of a camera or are detected asspeaking by a microphone of the client device, the client device maydetermine that there are two participants associated with the clientdevice. For example, a television or home speaker device may be used fora family communication session, where multiple family members in a roomwhere the television or home speaker device is location participate inthe communication session. When a plurality of participants isassociated with a client device, assistance features may be enabled ordisabled, e.g., by an administrator user of the device, by consent ofall participants, etc.

FIG. 3 is a flow diagram illustrating an example a method 300 to provideassistance during a communication session, according to someimplementations. In some implementations, method 300 can be implemented,for example, on a server system 102 as shown in FIG. 1 . In someimplementations, some or all of the method 300 can be implemented on oneor more client devices 120, 122, 124, or 126 as shown in FIG. 1 , one ormore server devices, and/or on both server device(s) and clientdevice(s). In described examples, the implementing system includes oneor more digital processors or processing circuitry (“processors”), andone or more storage devices (e.g., a database 106 or other storage). Insome implementations, different components of one or more servers and/orclients can perform different blocks or other parts of the method 300.In some examples, a first device is described as performing blocks ofmethod 300. Some implementations can have one or more blocks of method300 performed by one or more other devices (e.g., other client devicesor server devices) that can send results or data to the first device.

In some implementations, the method 300, or portions of the method, canbe initiated automatically by a system. In some implementations, theimplementing system is a first device. For example, the method (orportions thereof) can be periodically performed, or performed based onone or more particular events or conditions, e.g., a communicationsession being initiated by a user, a user joining a communicationsession in progress, a user answering a request for a communicationsession, and/or one or more other conditions occurring which can bespecified in settings read by the method. In some implementations, suchconditions can be specified by a user in stored custom preferences ofthe user.

In some examples, the first device can be a camera, cell phone,smartphone, tablet computer, wearable device, television, set top box,home speaker, or other client device that can initiate or join acommunication session based on user input by a user to the clientdevice, and can perform the method 300. Some implementations caninitiate method 300 based on user input. A user (e.g., operator orend-user) may, for example, have selected the initiation of the method300 from a displayed user interface.

An image as referred to herein can include a digital image having pixelswith one or more pixel values (e.g., color values, brightness values,etc.). An image can be a still image (e.g., still photos, images with asingle frame, etc.), a dynamic image (e.g., animations, animated GIFs,cinemagraphs where a portion of the image includes motion while otherportions are static, etc.) and a video (e.g., a sequence of images orimage frames that may include audio). While the remainder of thisdocument refers to an image as a static image, it may be understood thatthe techniques described herein are applicable for dynamic images,video, etc. For example, implementations described herein can be usedwith still images (e.g., a photograph, an emoji, or other image),videos, or dynamic images Text, as referred to herein, can includealphanumeric characters, emojis, symbols, or other characters.

In block 302, it is determined that a communication session isinitiated. For example, a user may provide user input to initiate acommunication session, e.g., an audio call, a video call, a messagingsession, etc. and identify one or more other users that participate inthe communication session. In another example, a communication sessionmay be initiated automatically by a client device, e.g., at a scheduledtime. In some implementations, a client device may determine that acommunication session is initiated based on user input responding to anincoming request for a communication session, e.g., answering anincoming audio or video call. In some implementations, determining thata communication session is initiated may include determining that acommunication session is in progress, and that a device that implementsmethod 300 has joined the communication session in progress. In someimplementations, determining that the communication session is initiatedincludes determining identities (e.g., user names, telephone numbers,email addresses, social media handles, etc.) of users that participatein the communication session. The method proceeds to block 312.

In block 312, it is checked whether user consent (e.g., user permission)has been obtained to use user data in the implementation of method 300.For example, user data can include media content sent or received by auser in a communication session, e.g., audio, video, etc., userpreferences, user biometric information, user characteristics (identity,name, age, gender, profession, etc.), information about a user’s socialnetwork and contacts, social and other types of actions and activities,content, ratings, and opinions created or submitted by a user, a user’scurrent location, historical user data, images generated, received,and/or accessed by a user, videos viewed or shared by a user, a user’scalendar or schedule, etc. One or more blocks of the methods describedherein may use such user data in some implementations only upon specificconsent from the user. User data for which the user has not providedconsent is not used.

In some implementations, user consent is obtained from each clientdevice that participates in the communication session. For example, iftwo users initiate an audio call, consent is obtained at each clientdevice. In some implementations, e.g., when one or more of the clientdevices is a videoconferencing system, or a television, consent may beobtained for the user identity associated with the client device, e.g.,an administrator user. In some implementations, user consent may bedetermined based on settings associated with the communication session.For example, if the communication session is a virtual meeting thatmakes use of client devices provided by an employer, it may bedetermined that each user has provided consent. In still someimplementations, an organizer of the session, e.g., a teacher in avirtual classroom may provide consent. In some implementations, one ormore of the users that participate in a communication session may chooseto decline consent for use of user data. User data of such users is notin implementing method 300.

In some implementations, a permission command may be sent to one or morethe computing devices (e.g., client devices 120-124) that participate ina communication session. The permission command may be configured tocause the computing devices to display a permission user interface,e.g., on a screen of the client device, via an audio prompt, etc. Thepermission user interface enables respective users of the computingdevices (e.g., users U1-U4 of client devices 120-124) to providerespective permission indications. For example, in some implementations,e.g., the configurations illustrated in FIGS. 2A and 2C, a serverdevice, e.g., server device 104 may send the permission command to thecomputing devices. In some implementations, e.g., the configurationillustrated in FIG. 2B, a client device 120, may send the permissioncommand to other client devices that are in the communication session.

Respective users of the client devices that receive the permissioncommand may provide respective permission indications, indicatingrespective user permissions for user data, e.g., media content from theclient device, to be used to implement method 300. For example, eachrequested user may grant permission for use of user data, e.g., mediacontent. In another example, one or more users may decline permission.In some implementations, users may grant permissions selectively. Forexample, a user may grant permission for use of certain user data, e.g.,images, videos, and calendar, and deny permission for use of other userdata. In these implementations, only such data for which the user hasprovided permissions is utilized in the implementation of method 300. Insome implementations, a user may provide permission for use of mediacontent generated by a client device of the user, e.g., audio and/orvideo for the communication session, and decline use of other user data.In the implementations, only the media content generated during thecommunication session is utilized in the implementation of method 300.The permission indications are sent to the device (e.g., server device104 or client device 120) that sent the permission command. Based on therespective permission indications, it may be determined whether allusers of have provided permission for use of user data, e.g., for use ofthe media content by a device that implements method 300.

If user consent has been obtained from the relevant user for which userdata may be used in the method 300, then in block 314, it is determinedthat the blocks of the methods herein can be implemented with possibleuse of user data as described for those blocks, and the method continuesto block 320. If user consent has not been obtained, it is determined inblock 316 that blocks are to be implemented without use of user data,and the method continues to block 320. In some implementations, if userconsent has not been obtained, blocks are implemented without use ofuser data and/or generic or publicly-accessible and publicly-usabledata.

In some implementations, it is determined if each user in thecommunication session provided permission for use of user data,including media content from the respective client device. In someimplementations, if any of the users decline permission for use of theiruser data, the communication session may be conducted without use ofuser data. For example, assistance features may be turned off during thecommunication session. In some implementations, assistance features maybe enabled based on user data from only from those users that providedpermission for use of user data. In some implementations, users may beprovided with options to change their permission at any time during thecommunication session. When a user changes user permission to not allowuse of user data, use of such data is immediately ceased.

In block 320, it is determined if participants in a communicationsession have consented to assistance, e.g., assistance provided by asystem that implements method 300, during the communication session. Insome implementations, if it is determined in block 312 that all usershave provided user permission for use of user data, it may be determinedthat the users have consented to assistance and the method proceeds toblock 322. In some implementations, e.g., when users have providedpermission for use of their media content generated during thecommunication session, the media content is used to implement method300. If the participants have provided consent to assistance during thecommunication session, the method proceeds to block 322. If it isdetermined that the participants have not provided consent, the methodproceeds to block 340.

In block 322, media content from the communication session is received.In particular, received media content may include receiving audio,video, and/or text generated or provided in the communication sessionfrom various client devices that participate in the communicationsession. If users of one or more client devices do not provide userpermissions to receive media content, such media content is notreceived. In some implementations, the communication session is asynchronous communication session in which participants are present atthe same time and provide respective media content, e.g., audio, video,and/or text. For example, a synchronous communication session mayinclude an audio call, e.g., a telephone call, a call usingvoice-over-IP technology, etc.; a video call, e.g., a call that includesboth audio and video from one or more participants; a messaging sessionwhere different participants exchange text messages synchronously, etc.The method proceeds to block 324.

In block 324, one or more information items are determined based on thereceived media content. Determination of information items is explainedwith reference to FIG. 4 . The method proceeds to block 326.

In block 326, the information item is displayed. In someimplementations, the information item is displayed concurrently by eachclient device that participates in the communication session. In someimplementations, a command is sent by a device that implements method300 to the client devices in the communication session to display theinformation item. In some implementations, the command may cause each ofthe client devices in the communication session to retrieve theinformation item. For example, an information item may be a publiclyavailable information item on the Internet, e.g., a map, informationabout a business such as a restaurant, a publicly available video, etc.

In some implementations, access to the information item may berestricted to specific users. For example, the information item may be adocument, an image, or other content available to specific useraccounts. In some implementations, the information item may be stored ina server system, e.g., in database 106. In some implementations, theinformation item may be stored in local storage of one or more of thecomputing devices in the communication session, e.g., client devices120-124.

In some implementations, each client device that receives the command todisplay the information item retrieves the information item. In someimplementations, a device that implements method 300 or any of theparticipating computing devices may send the information item to thecomputing devices in the communication session. For example, if theinformation item is stored in database 106, server device 104 may sendthe information item to computing devices in the communication session.In some implementations, the information item may be sent selectively toonly those devices that request the information item from the server oranother computing device in the communication session.

In some implementations, e.g., when the information item includes audio,displaying the information item includes playback of the audio. In someimplementations, e.g., when the information item includes video,displaying the information item includes playback of the video. In someimplementations, the information item may be an interactive application,e.g., an application that executes within a communication applicationthat provides the communication session. In these implementations,displaying the information item includes displaying an interactive userinterface of the interactive application. For example, interactiveapplication may be a document editing application, a game application,etc. In some implementations, the information item may be displayed suchthat it occupies a majority of the screen of computing devices thatparticipate in the session, and media content, e.g., video from thecommunication session, is displayed in smaller size. FIG. 8 illustratesan interactive quiz application that is displayed in this manner.Displaying the interactive application in this manner may beadvantageous, e.g., a participant user can modify a document in adocument editing application while the communication session is inprogress.

In some implementations, the information item may be based on thereceived media content. For example, the information item may include atext transcript of an audio portion of the received media content, atranslation of the audio portion of the received media content to adifferent language, etc. In some implementations, the information itemincludes an augmentation or altered version of the received mediacontent. For example, the information item may be an illustration to bedisplayed atop a video portion of the received media content. Forexample, if a participant in the communication session utters a phrase“Happy Valentine’s Day!” or “I love you,” the illustration may includehearts or balloons displayed atop a video portion. In other examples,for a birthday, balloons, confetti or other birthday-related graphics orphrases. In yet another example, illustration or phrases for achievementin a competition can be displayed such as trophies, a phrase such as“Congratulations!” or the like. While specific phrases are listed here,the assistant application may utilize techniques to determine equivalentphrases or emotions from speech and/or video during the communicationsession, e.g., by using a machine-learned model to determine suchcontext. In another example, the information item may include audioreceived from a particular user, rendered in a different voice and/oraccent, e.g., a celebrity’s voice or accent. Some implementations caninclude context (or interest-based) information retrieval, when userspermit determination of context. For example, such information mayinclude hotel prices if it is detected that the conversation is aboutvacation, or traffic conditions if it is detected that the conversationis about an in-person meeting in the near future, e.g., in an hour. Someimplementations can also include notifying both users of events ortopics of interest to both users, for example, if it is determined bothusers are soccer fans, based on conversation context, the users may benotified of important developments, e.g., goals scored, in a soccer gameof the users’ interest.

In some implementations, e.g., in which the media content for thecommunication session includes a video from one or more of theparticipant users, the first command causes display of the informationitem such that a face or other content of the video is not obscured bydisplay of the information item. For example, the received media contentmay be analyzed to determine whether a face or other content is presentthat is not to be obscured. Analysis of media content is describedfurther with reference to FIG. 4 . Face detection techniques may beutilized in such analysis. For example, a plurality of video frames invideo received from participant users may be analyzed to determine aposition of the face within the video content, and identify positions inthe frames that do not include the face that may be suitable to displaythe information item. Detection of faces is performed when respectiveparticipant users have provided consent. When one or more faces aredetected, the detected position of the faces are used to provide thefirst command, e.g., the information item is displayed such that aposition of the information item does not obscure the one or more faces.In some implementations, when a user interacts with a user interfaceprovided by an assistant application, e.g., scrolls user interface,increases a size of a user interface window, views items in a summary ofthe conversation in the communication session, positions and/or sizes offaces in video of the communication session are adjusted such that facesare not obscured by the user interface provided by the assistantapplication.

Other content from the video, e.g., one or more other faces, may also bedetermined. In some implementations, one or more visual characteristicsof the user interface for the information item are selected such thatthe information item is easily visible to participants of thecommunication session. For example, if video displayed on a computingdevice, e.g., received from other participant users, is detected ashaving a dark or black background, the user interface may be displayedwith text of the information item in white or a light color thatprovides contrast from the background. In another example, a plainbackground color (e.g., white) may be used for the user interface andthe information items may be displayed atop the background such that theinformation items are easily visible. In some implementations, the userinterface includes only the information items, e.g., displayed directlywith video from the communication session, without other elements suchas background, window borders, etc. In some implementations, the userinterface is generated and displayed separate from video and/or imagesreceived as part of the communication session. In these implementations,the user interface is not part of a video feed exchanged betweenparticipant computing devices in the communication session, and isdisplayed as an overlay on the video feed. For example, the video feedmay be exchanged directly between two client devices that participate ina communication session, while the user interface may be provided by adifferent computing device, e.g., server device 104, that implements anassistant application.

In some implementations, the command to display the information item maycause the information item to be displayed at a particular positionrelative to the face in the user interface such that the face is notobscured. In some implementations, the command to display theinformation item may cause the video from one or participant users to beadjusted, e.g., shrunk in display size, repositioned in the userinterface, etc. based on the type and/or display size for theinformation item. For example, if the information item is a document,one or more of videos from participant users may be minimized, e.g.,shrunk in size, or moved to a different location, e.g., to a top orbottom of the user interface, to one side of the user interface, etc. Insome implementations, display of video may be turned off, e.g.temporarily, by the command to cause display of the information item.The method proceeds to block 328.

In block 328, it is determined whether the communication session is tobe terminated, e.g., based on user input. If it is determined thattermination input has been received and that the communication sessionis to be terminated, the method proceeds to block 346. If it isdetermined that termination input has not been received, the methodproceeds to block 330.

In block 330, it is determined if there is user input indicative of arequest to stop assistance during the communication session. Forexample, one or more users that participate in the communication sessionmay provide input, e.g., a user command, to turn off assistance. In someimplementations, a visual indicator that indicates an active or inactivestatus of assistance, e.g., that an assistant application or assistantprogram that provides assistance is active or inactive, may be providedin a user interface of the communication application. In someimplementations, the visual indicator may be selectable. Upon userselection of the visual indicator that indicates active status of anassistant application (e.g., indicating that the user has provided auser command to turn off assistance), assistance is turned off, e.g.,the assistant application or assistant program is disabled, terminated,denied access to media content from the session, etc. In someimplementations, e.g., when the assistant application or assistantprogram is executed by a server device 104 that is not a participant inthe communication session, the server device that provides the assistantapplication may be removed, e.g., disconnected, from the communicationsession. In some implementations, users may provide user input to turnoff assistance e.g., by tapping, clicking on, or otherwise selecting thevisual indicator. In some implementations, users may provide voice inputor speak a command to turn off assistance. When assistance is turnedoff, media content from the communication session, e.g., audio, video,or text, exchanged between participant users, is not available to theassistant application. In some implementations, some users in thecommunication session may be restricted from providing input to turn offassistance. In these implementations, for the users who are restrictedfrom providing input to turn off assistance, the visual indicator is notselectable. For example, during a communication session that is a jobinterview, the person who initiates or manages the session, e.g., aninterviewer, may selectively enable assistant restrictions, e.g.,one-way assistance or assistance limited to certain features such ascalendar. If one or more users in the communication session provideinput to turn off assistance, the method proceeds to block 340. Else,the method proceeds to block 322 to receive subsequent media content inthe communication session.

In block 340, the communication session is conducted without assistance.For example, if participant users in the communication session do notprovide consent for assistance in block 320, or if one or moreparticipant users provide user input to stop assistance, block 340 maybe performed. When assistance is turned off, media content is notprovided to an assistant application, e.g., assistant application 158 a,158 b, and 158 c. The method proceeds to block 342.

In block 342, one or more users may provide user input to startassistance, e.g., by selecting visual indicator that indicates thestatus of assistance. In response to the user input, it is determinedwhether the participant users have consented to assistance. In someimplementations, a user interface may be displayed to those participantusers that have not provided consent for assistance and their consent isobtained. The user input to start assistance includes such input. Whenone or more participant users decline consent for assistance, it isdetermined that assistance cannot be started and the method proceeds toblock 344. If user input is received to start assistance and participantusers consent to start assistance, the method proceeds to block 322.

In block 344, it is determined whether the communication session is tobe terminated, e.g., based on user input. If it is determined thattermination input has been received and that the communication sessionis to be terminated, the method proceeds to block 346. If it isdetermined that termination input has not been received, the methodproceeds to block 340.

In block 346, the communication session is terminated. For example, aserver device 104 that hosts the communication session, e.g., asillustrated in FIG. 2C, may terminate the communication session. Inanother example, e.g., when client devices engage directly in acommunication session, as illustrated in FIGS. 2A and 2B, one or more ofthe client devices may terminate the communication session. In someimplementations, upon termination of the communication session, asession summary is provided. For example, the session summary mayinclude a text transcript of one or more portions of media contentreceived during the communication session and one or more informationitems provided during the communication session. An example sessionsummary is illustrated in FIG. 7 . In some implementations, the sessionsummary can also include one or more follow up actions, e.g., propose toset up a reminder or calendar event for an item mentioned in theconversation.

FIG. 4 is a flow diagram illustrating an example a method 400 to provideinformation items in a communication session, according to someimplementations. In some implementations, method 400 can be implemented,for example, on a server system 102 as shown in FIG. 1 . In someimplementations, some or all of the method 400 can be implemented on oneor more client devices 120, 122, 124, or 126 as shown in FIG. 1 , one ormore server devices, and/or on both server device(s) and clientdevice(s). In described examples, the implementing system includes oneor more digital processors or processing circuitry (“processors”), andone or more storage devices (e.g., a database 106 or other storage). Insome implementations, different components of one or more servers and/orclients can perform different blocks or other parts of the method 400.In some examples, a second device is described as performing blocks ofmethod 400. Some implementations can have one or more blocks of method400 performed by one or more other devices (e.g., other client devicesor server devices) that can send results or data to the first device.

In some implementations, the method 400, or portions of the method, canbe initiated automatically by a system. In some implementations, theimplementing system is a second device. For example, the method (orportions thereof) can be periodically performed, or performed based onone or more particular events or conditions, e.g., media content beingreceived during a communication session.

In one example, the second device can be a camera, cell phone,smartphone, tablet computer, wearable device, television, set top box,home speaker, or other client device that can engage in a communicationsession based on user input by a user to a client device, and canperform the method 400.

In block 402, it is checked whether user consent (e.g., user permission)has been obtained to use user data in the implementation of method 400.For example, user data can include media content sent or received by auser in a communication session, e.g., audio, video, etc., userpreferences, user biometric information, user characteristics (identity,name, age, gender, profession, etc.), information about a user’s socialnetwork and contacts, social and other types of actions and activities,content, ratings, and opinions created or submitted by a user, a user’scurrent location, historical user data, images generated, received,and/or accessed by a user, videos viewed or shared by a user, a user’scalendar or schedule, etc. One or more blocks of the methods describedherein may use such user data in some implementations only upon specificconsent from the user. User data for which the user has not providedconsent is not used.

In some implementations, user consent is obtained from each clientdevice that participates in the communication session. For example, iftwo users initiate an audio call, consent is obtained at each clientdevice. In some implementations, e.g., when one or more of the clientdevices is a videoconferencing system, or a television, consent may beobtained for the user identity associated with the client device, e.g.,an administrator user. In some implementations, user consent may bedetermined based on settings associated with the communication session.For example, if the communication session is a virtual meeting thatmakes use of client devices provided by an employer, it may bedetermined that each user has provided consent. In still someimplementations, an organizer of the session, e.g., a teacher in avirtual classroom may provide consent. In some implementations, one ormore of the users that participate in a communication session may chooseto decline consent for use of user data. User data of such users is notin implementing method 300.

If user consent has been obtained from the relevant users for which userdata may be used in the method 400, then in block 404, it is determinedthat the blocks of the methods herein can be implemented with possibleuse of user data as described for those blocks, and the method continuesto block 410. If user consent has not been obtained, it is determined inblock 406 that blocks are to be implemented without use of user data,and the method continues to block 410. In some implementations, if userconsent has not been obtained, blocks are to be implemented without useof user data and/or generic or publicly-accessible and publicly-usabledata.

In block 410, received media content is analyzed. For example, mediacontent may be received as described above with reference to block 322of FIG. 3 . In some implementations, media content may include audio,video, and/or text provided by one or more participant users viarespective computing devices, e.g., client devices 120-124, in thecommunication session. Analysis of received media content may beperformed using one or more of several different techniques.

In some implementations, the received media content may include audio.In these implementations, speech-to-text techniques may be used todetermine the contents of a user’s speech. In some implementations,speech-to-text techniques may utilize a machine-learning applicationthat utilizes a trained model to convert speech to text. In someimplementations, the trained model may be implemented with a neuralnetwork that includes long short-term memory (LSTM) nodes. In someimplementations, speech biasing might be adjusted specifically for anaudio call. In some implementations, the trained model might be trainedspecifically for audio and/or video calls, e.g., the model is trainedsuch that training data used to train the model excludes audio and videodata from sources that are not audio or video conversations. The modelcan be re-trained (or adjusted) on the fly, e.g., during or after aparticular conversation. Further, if permitted by the user, parametersof the model can be saved and associated with a user. This can enablethe model to be initialized for the particular user with the savedparameters, e.g., when a next conversation that includes the user isstarted. In some implementations, analysis of audio may includedetermining a source language spoken by a participant user. In someimplementations, the received audio may include music. In theseimplementations, audio fingerprinting techniques may be used to identifythe music, e.g., a song title. In some implementations, analyzing theaudio may include determining whether a user has provided a command toexplicitly invoke assistance during the communication session.

In some implementations, the received media content may include video.In these implementations, video analysis techniques such as facedetection, motion detection, gesture detection, etc. may be utilized toanalyze received media content. For example, analysis may be performedto determine whether video from a client user includes a gesture that isassociated with explicit invocation for assistance during thecommunication session. In another example, analysis may be performed todetermine a position of one or more faces in the video. The position ofthe one or more faces is utilized, e.g., to generate the first commandsuch that display of information items does not obscure the one or morefaces.

In some implementations, the received media content may include text. Insome implementations, text analysis may be performed, e.g., usingpattern matching, topic detection, etc. to determine whether the text(or text transcribed using speech-to-text techniques) in thecommunication session includes an explicit invocation for assistance. Insome implementations, text analysis may be performed to determine atopic of conversation during the communication session, e.g., ski trip,brunch, etc.

In some implementations, the received media content may include alocally-generated representation of user activity, generated by arespective client device of one or more participant users. For example,in some implementations, one or more client devices may be configuredwith a machine-learning application that generates the representation oflocal user activity. For example, the machine-learning application mayperform speech-to-text conversion to generate the locally-generatedrepresentation in text form, when the local user activity is a userspeaking during the communication session. In some implementations, thelocally-generated representation may be in a machine-readable form. Thelocally-generated representation may be smaller in size than audio orvideo transmitted from the respective client device. Transmitting alocally-generated representation instead of the audio or video mayreduce bandwidth requirements while enabling a device that implementsmethod 400 to perform the analysis to determine one or more informationitems.

In these implementations, a device that implements method 400 mayinclude a machine-learning application that analyzes thelocally-generated representation in machine-readable form to drawinferences. For example, the inferences may include determining whetherthe user speech included an explicit invocation for assistance duringthe communication session, determining a gender of the speaker, anestimated age of the speaker, whether the speaker is indoor or outdoor,a language that the speaker speaks during the communication session,etc. based on the locally-generated representation. In another example,e.g., when the locally-generated representation is based on videotransmitted from the respective client device, the machine-learningapplication may provide inferences such as whether a face is present inthe video, a position of the face, a number of faces in the videogenerated by the client device, etc. The inferences may be used todetermine information items.

In block 412, it is determined whether the received media contentincluded an explicit invocation for assistance during the communicationsession. For example, a participant user may speak a particular phrase,e.g., “Assistant, show my calendar,” “Assistant, show me on a map,” etc.In this example, the phrase “Assistant” may be an invocation phrase orhotword that indicates explicit invocation. The phrase “Assistant” isone example, and users may select any phrase of their choice to invokeassistance. In some implementations, explicit invocation of assistancefeatures may include a text command (e.g., “@assistant, show mypictures”) or gesture, e.g., a particular gesture associated withinvocation of assistance features. If it is determined that the mediacontent included an explicit invocation, the method proceeds to block414. If it is determined that the media content did not include anexplicit invocation, the method proceeds to block 430.

In block 430, conversation context in the communication session isdetermined. For example, if audio in the media content from a first userU1 in the communication session includes a phrase “went skiing inTahoe,” followed by audio in the media content from a second user U2 inthe communication session includes a phrase “You must have amazingpictures!” and subsequently a phrase “let me show you the photos” fromU1, it may be determined that the context includes “skiing,” “Tahoe,”and “pictures”. Further, when user U1 permits access to user data, e.g.,photos, it may be determined that the user has recent photos taken atLake Tahoe. Based on the context and user data that matches the context,it may be determined with high confidence score that the conversationcontext indicates implicit invocation. In another example, if a userutters a phrase “I wonder how far restaurant A is,” it may be determinedwith high confidence score that the user is interested in knowing thedistance to restaurant A from the user’s current location. In yetanother example, a discussion with phrases such as “That’s interesting!Tell me more,” or “I don’t know about that” may indicate a conversationwhere it is unlikely that the user would benefit from assistance andhence, a low confidence score may be associated with invocation.

In some implementations, when users permit use of user activity andfeedback data, past user actions and/or feedback upon a determination ofimplicit invocation may be utilized to determine the confidence score.For example, if implicit invocation was determined in the past, but theuser didn’t interact with information items that were provided asassistance, it may be determined that the conversation context may havebeen unsuitable to invoke assistance. In another example, if userprovides feedback, e.g., turns off assistance, dismisses informationitems, etc., it may be determined that the conversation context may havebeen unsuitable to invoke assistance. In some implementations, users mayprovide user preferences for implicit invocation, e.g., “do not invokewhen I speak with person P,” “always invoke when I speak with person M,”“invoke during scheduled meetings, but not during unscheduled calls,”etc. In some implementations, when users permit use of user interactiondata, the interaction data may be utilized to determine suitablecontexts to provide assistance. For example, in the case of explicitinvocation by a user, the context prior to the explicit invocation canbe used as an example of a context where automatic provision ofassistance is appropriate.

In some implementations, a request for contextual information is sent toone or more of the computing devices that participate in thecommunication session. For example, the request may indicate thecomputing device provide information such as a current location and/orpast locations of the computing device, recent user activity on thecomputing device, user profile information stored locally on thecomputing device, etc. for determining context during the communicationsession. The computing device may be configured to determine whether theuser has provided permission to provide such information. If the user ofthe computing device has provided permission, the requested contextualinformation is sent in response to the request. In some implementations,contextual information from the computing devices in the communicationsession may assist in determining whether assistance has been invoked.In some implementations, the contextual information may be used toidentify one or more information items, as described below. Thecontextual information from the user device may be beneficial, since itcan provide additional contextual signals, in addition to the contextdetermined based on media content received from the device. The methodproceeds to block 432.

In block 432, it is determined whether the confidence score determinedfor implicit invocation meets a threshold. In some implementations, asession-specific confidence score threshold may be set for eachcommunication session. For example, if users permit access to pastinteraction data, a communication session where participant users aredetermined to make high use of assistance features may have a relativelylow confidence score threshold. In another example, if users in aparticular communication session are less likely to make use ofassistance, the confidence score threshold may be set at a relativelyhigher value. In different implementations, the confidence scorethreshold is set such that users benefit from implicit invocation ofassistance. Confidence score threshold may also be adjusted during acommunication session, e.g., based on user actions to choose or dismissassistance notifications during the communication session. In someimplementations, the confidence score can be based on the trained modelfrom prior sessions, e.g., that indicate a likelihood that a user islikely to find assistance valuable or not. Further, the confidence scoremay be adjusted based on a detected topic or context of conversation.For example, assistance features that are fun features (e.g., “hearts,”“balloons,” “face masks,” etc.) may be turned off, e.g., have a lowconfidence score, based on determination that a particular conversationis a business conversation. Triggering of assistance based on theconversation may be based on the conversation, e.g., content in theconversation provided by multiple participants. The confidence score maybe determined to meet the threshold based on contribution to theconversation from individual participants and/or interaction betweenparticipants. For example, if User1 mentions “weather” and assistance istriggered, assistance is also triggered, e.g., if User2 subsequentlymentions “weather.” The method proceeds to block 414.

In block 414, when participant users in a communication session permitaccess to user data, information items of one or more participants mayidentified in response to the invocation. For example, such informationitems may include a user’s documents, photos, videos, calendar, locationinformation, etc. Users may be provided with options to indicatepermissions for individual information items, information item types,etc. to exclude. When the invocation is explicit, e.g., “Assistant, showmy calendar,” the corresponding information items are identified basedon a user-initiated command. When the invocation is implicit, e.g., “letme show you my photos,” the conversation context may be utilized toidentify information items, e.g., a user’s photos from a recent trip toLake Tahoe. In some implementations, e.g., where participant users donot permit access to user data, block 414 is not performed. The methodproceeds to block 416.

In block 416, information items from one or more public sources may beidentified. For example, public sources may include any type ofavailable source, such as maps, sports schedules and scores, newswebsites, recipes, etc. that is identified based on the invocation. Forexample, if the user’s conversation context includes “restaurant A,” amap showing the location A may be identified. In another example, if theuser’s conversation context includes “Roger Federer” and “Wimbledon,”information may be retrieved, e.g., from a knowledge graph, about thenumber of Wimbledon titles Roger Federer has won, most recent result atWimbledon for Roger Federer, etc. The method proceeds to block 418.

In block 418, information items from sources that are shared betweenparticipant users, e.g., shared folders of computer files, shared photoalbums, shared documents, etc. may be identified. For example, when thecommunication session is a meeting, a meeting agenda and one or moredocuments to be reviewed during the meeting may be retrieved. In someimplementations, retrieval of information items from shared sources maybe performed based on user accounts at a server system that provides theshared sources, e.g., a file sharing service, a photos service, etc.when users provide consent for such automatic retrieval. In someimplementations, one or more shared permissions can be used. Forexample, the shared permissions may indicate that an assistantapplication may surface information item(s) that can be seen by multipleor all participants in the conversation, e.g., documents, images,videos, etc. to which all participants have access per an access controlpolicy. In this example, prior to providing the information item(s), acheck may be performed to determine which of the participants can accessthe information. For example, the check may be based on recognizing theparticipants, e.g., if the users permit, by use of facial recognitiontechniques, based on a user account, etc. The method proceeds to block420.

In block 420, it is determined whether at least one of the retrievedinformation items matches the conversation context or the explicitinvocation. For example, if the conversation context indicates “skiing,”“Lake Tahoe,” and “photos” and if no photos are identified that matchthe context, it may be determined that the retrieved information itemsare not suitable for provision in the communication session. In anotherexample, if the explicit invocation indicates “Document A,” and if nomatching document is found, it may be determined that there are noinformation items that meet the threshold. If no information items areretrieved, the method proceeds to block 434. If at least one informationitem is retrieved, the method proceeds to block 422. In someimplementations, a different criterion, e.g., “at least two informationitems” may be used. In some implementations, the criterion may be basedon the conversation context or the explicit invocation. For example, ifthe explicit invocation is for three most recent videos, the criterionmay be set as “at least three information items.” Some implementationscan include multiple criteria. Some of the criteria can be more strictthan other criteria. Based on the strictness, information items can bepresented in different ways. For example, if it is detected that theconversation is about a particular document, e.g., a user speaks about adocument with the title “My important report,” and that a document thatmatches the title is available in the user’s files, the document may beprovided in the communication session. However, if there is no documentthat has an exact match to the title (e.g., one or more availabledocuments are possible matches, e.g., have a substring that matches thetitle), an assistant application may provide a user interface for aconversation participant (e.g., an owner of the document) that enablesthe conversation participant to select a document from the one or moreavailable documents, or to create a new document with the title.

In block 434, it is determined if assistance was explicitly invoked. Ifassistance was explicitly invoked, but matching information items werenot identified, a failure message, e.g., “unable to retrieve requesteddocument,” may be provided, e.g., to be displayed in a user interfaceduring the communication session. The method proceeds to block 410 wherefurther media content received in the communication session is analyzed.

In block 422, one or more of the identified information items areprovided in the communication session. In some implementations, theinformation item may be a user interface that indicates that theassistant application is performing a particular activity. For example,in some implementations, the assistant application may provide notetaking functionality, in response to explicit invocation, e.g.,“Assistant, take notes” or implicit invocation. In theseimplementations, the information item is a user interface that indicatesthat the assistant application is taking notes, e.g., if a participantuser in the communication session speaks and lists ingredients from arecipe, the assistant application may transcribe the speech to make alist of the ingredients. In some implementations, a participant user mayfurther invoke the assistant, e.g., to add the list of ingredients fromthe notes to a shopping list, and set a reminder to purchase theingredients during a subsequent trip to a grocery store. In someimplementations, the notes may be added to a meeting summary that isprovided by the assistant application at the end of the communicationsession. The method proceeds to block 410 where further media contentreceived in the communication session is analyzed.

FIG. 5A is a diagrammatic illustration of an example user interface 500,according to some implementations. In the example shown in FIG. 5A,users of computing devices 502 and 504 are engaged in a communicationsession. In the example shown in FIG. 5A, the communication session is avideo call. Device 502 displays video received from device 504 in largesize and video generated at device 502 in smaller size. Similarly,device 504 displayed video received from device 502 in large size andvideo generated at device 504 in smaller size. The user of device 504has spoken phrase 506 “Just got back from vacation! Went skiing inTahoe!” to which the user of device 502 has responded with phrase 508“You must have amazing pictures!”

FIG. 5B is a diagrammatic illustration of an example user interface 520,according to some implementations. Continuing the example illustrated inFIG. 5A, the user of device 504 responds to the user of device 502 withphrase 522 “Yes! Let me show you the photos!” In response to the userinteraction, it is determined that assistance is invoked. A userinterface 524 is displayed on device 504 with information itemsdetermined based on the context, e.g., as determined from phrases 506,508, and 522.

User interface 524 includes photos in a photo library of the user ofdevice 504 that match the context, e.g., photos from a recent ski tripto Lake Tahoe. Further, it is determined that the photos are private,e.g., not shared with the user of device 502. User interface 524includes a message “Here are your ski pictures! OK to share?” that theuser can select to share the photos with the user of device 502. In someimplementations, a first command is sent to a first computing device,e.g., the device 504, to display a user interface 524 that includes aselectable user interface element (e.g., the text “OK to share?”) that auser of the device can select to indicate user permission to share thephotos during the communication session. In response to user selectionindicating that the user has provided permission to share the photos, anindication is sent to the device that sent the first command, e.g.,server device 104, that the user has provided the permission to sharethe photos. Server device 104, or another client device that providesassistance, sends a subsequent command to display the photos in thecommunication session. If the user chooses to not share the photos inthe communication session, the photos are not displayed in thecommunication session. While this example illustrates providing a userinterface in the communication session with photos when permission isprovided by the user, any type of information item may be shared in thecommunication session when users provide permissions for sharing.

In some implementations, e.g., when the photos are previously sharedwith the user of device 502, if the photos are publicly available, or ifthe user of device 504 has previously provided permission to sharephotos in communication sessions, the user interface 524 may not bedisplayed, and instead, the identified information items, e.g., photosare automatically displayed in the communication session.

FIG. 5C is a diagrammatic illustration of an example user interface 540,according to some implementations. Continuing the example illustrated inFIG. 5B, the user of device 504 has granted permission to share photosin the communication session. As illustrated in FIG. 5C, user interface542 that includes the photos is displayed concurrently on each ofdevices 502 and 504. In some implementations, either or both of theusers in the communication session may control the user interface, e.g.,scroll the photos, zoom into a particular photo, etc. In response touser input, e.g., to scroll the photos, the user interface on bothdevices is updated. As illustrated in FIG. 5C, the user of device 502has responded with phrase 544 “Looks like you took a tumble!”

FIG. 6A is a diagrammatic illustration of an example user interface 600,according to some implementations. In the example shown in FIG. 6A,users of computing devices 602 and 604 are engaged in a communicationsession. In the example shown in FIG. 6A, the communication session is avideo call. Device 602 displays video received from device 604 in largesize and video generated at device 602 in smaller size. Similarly,device 604 displayed video received from device 602 in large size andvideo generated at device 604 in smaller size. The user of device 602has spoken phrase 606 “Are you still up for brunch on Sunday?” the userof device 604 has responded with phrase 608 “Yes, where should we go toeat?” and the user of device 602 has responded with phrase 610 “Any goodbrunch places in Napa?” The conversation context for the communicationsession illustrated in FIG. 6A is determined, e.g., that the contextincludes “brunch on Sunday” and that a likely location is “Napa.”

FIG. 6B is a diagrammatic illustration of an example user interface 620,according to some implementations. Continuing the example illustrated inFIG. 6A, user interface displayed on each of devices 602 and 604 isupdated based on the conversation context to display information item626 that includes restaurant options for brunch in Napa, e.g.,“Restaurant A,” “Restaurant B,” and “Restaurant C.” Further, additionaluser information that is useful in the conversation context, e.g.,star-ratings, is also included. The user of device 604 responds to theuser of device 602 with phrase 622 “What about restaurant A? I wonder ifit’s open” In response and the user of device 602 responds with phrase624 “I like it!” It is determined that additional conversation contextis provided by this subsequent interaction. For example, the additionalcontext is the query “is restaurant A open?” based on phrases 622 and624.

FIG. 6C is a diagrammatic illustration of an example user interface 640,according to some implementations. Continuing the example illustrated inFIG. 6B, user interface displayed on each of devices 602 and 604 isupdated based on the conversation context to display information item644 that includes detailed information about restaurant A, including thehours when restaurant A is open, cuisine type, a brief description ofthe ambience at restaurant A, etc. The user of device 604 responds withphrase 642 “Looks like it is! Assistant, show me on a map please.” Theconversation context is further updated based on phrase 642. Further, itis detected that the user of device 604 made an explicit invocation forassistance.

FIG. 6D is a diagrammatic illustration of an example user interface 660,according to some implementations. Continuing the example illustrated inFIG. 6C, user interface displayed on each of devices 602 and 604 isupdated based on the conversation context to display information item662 that includes a map showing the location of restaurant A. In theexample illustrated in FIG. 6D, it is further determined that display ofinformation item 662 occupies a larger area and that a face of the usersengaged in the communication session may be obscured due to overlay ofinformation item 662. As illustrated in FIG. 6D, faces of the users aredisplayed closer to the top of the screen of each of devices 602 and604, and the faces are reduced in size, e.g., from full-screen displayof FIGS. 6A-6C, to a partial screen display of FIG. 6D.

As illustrated in FIGS. 6A-6D, as conversation in the communicationsession progresses, context is updated. Based on determined context anduser interaction, various information items are identified and displayedin the communication session. Further, assistance may be provided basedon implicit invocation and/ or explicit invocation.

While FIGS. 5 and 6 illustrate a visual user interface provided by anassistant application, in some implementations, the assistantapplication may provide assistance in audio or video form. For example,upon determining that the conversation in a communication sessionincludes a user query, e.g., “distance to restaurant A,” the assistantapplication may provide the answer in audio form, e.g., “restaurant A isfive miles away” in addition to or alternatively to providing thisinformation item in visual form. In some implementations, e.g., when thecommunication session is an audio-only session, when one or more ofcomputing devices in the session are not equipped with a screen (or areconfigured with the screen turned off), the assistant application mayprovide the answer in audio format. In some implementations, users maybe provided with options to indicate a preferred format, e.g., visual oraudio, for information items. In some implementations, an assistantapplication may provide the information item on an alternative device,e.g. for example when the user participates in the conversation on adevice without a screen or with limited screen space, the informationitem may be displayed on the user’s smartphone or tablet screen, e.g.,on a smartphone or tablet that is linked with the user account thatparticipates in the conversation.

In some implementations, assistant applications can be customized forparticipant users, e.g., based on the user’s context, user’s voice orspeech patterns, etc. For example, the assistant application may detectthat the user is participating in the communication session via a homespeaker device that is not equipped with a screen, and based on thisdetermination, provide information items in audio format. In someimplementations, the assistant application may determine that the useris participating in the communication session from a public location,and in response, the assistant application may provide information itemsin a user interface displayed on a screen and turn off audio assistance.In some implementations, the assistant application may learn over time,e.g., if users permit the assistant application to learn userpreferences from user activity in communication sessions, the assistantapplication can customize assistance features over time.

In some implementations, the assistant application may be customizedbased on user preferences, e.g., a user may indicate that she prefers toreceive assistance during communication sessions with specific otherusers, and no assistance in communication sessions with otherparticipants. In some implementations, the user may indicate apreference that the assistant application provide assistance forspecific contexts, e.g., note taking, translation, etc. and/or notprovide assistance for specific contexts, e.g., conversations aboutphotos. In these implementations, the assistant application may provideassistance based on indicated user preferences. In some implementations,when users permit use of additional context information such aslocation, the assistant application may be customized based on suchadditional context information. For example, the assistant applicationmay more likely provide a suggestion to take notes based on adetermination that the user is at work, a location associated withnotetaking, and provide the suggestion with a lower likelihood, e.g.,when the user is at a non-work location.

In some implementations, an assistant application may provide additionalassistance features. For example, in response to a user command, theassistant application may record and/or transcribe spoken conversationin a communication session, e.g., to facilitate later retrieval. Forexample, the assistant application may be invoked in a meeting held viaaudio or video conferencing to take meeting notes, to display meetingagenda, to record action items postmeeting, etc.

In some implementations, an assistant application may provide assistancefeatures specific to a communication session. For example, the assistantapplication may be usable by participant users to control settingsduring the communication session, e.g., “mute audio,” “turn off camera,”“switch to back-facing camera,” etc. In some implementations, assistantapplication may add users to a communication session, e.g., when aparticipant user indicates “add Dad to the call.” In someimplementations, assistant application may recognize multiple users thatparticipate in the communication session from the same computing device,e.g., a videoconferencing system, a television set, etc. and provideassistance features based on determination that there are multiple userspresent. In some implementations, the assistant application may disableuser interface features, e.g., prompts to approve sharing certainphotos, at the computing device. In some implementations, the assistantapplication may provide user interface features, e.g., prompts toapprove sharing certain photos, at alternate computing devices that areassociated with a user account for a user that is a participant in theconversation from a device where multiple users are recognized asparticipating in a communication session.

In some implementations, assistant applications may perform visualrecognition, e.g., based on video exchanged during a communicationsession, and provide information items. For example, a participant usermay provide video from a camera of their computing device. The assistantapplication may detect that the video includes a recognized object,e.g., a monument, a book, a media item, etc., and display a userinterface that provides information or actions associated with therecognized object. For example, the user interface may provideinformation about a monument and permit the participant users add themonument in personal or shared lists of destinations to visit. Inanother example, the user interface may provide information about a bookrecognized from the video, and the user interface may provide optionsfor participant users to purchase the book, to add the book to awishlist, etc.

In some implementations, the user interface may include informationitems personalized for each user. For example, when the assistantapplication displays a map showing restaurant A, the assistantapplication may include directions to the restaurant from a respectivelocation of each participant user in the communication session. Inanother example, e.g., when the assistant application assists users inscheduling a meeting, the assistant application may display the meetingtime in local time at the location each participant user. In someimplementations, information items may be customized based on userpreferences, e.g., ingredient quantities in a recipe may be displayed inunits preferred by a user, e.g., in grams or ounces.

Assistant applications as described herein may provide information itemsconcurrently to a plurality of participant users in a communicationsession. Users may interact with the information items, with theinteraction being mirrored on the user interfaces for other participantusers. In some implementations, the assistant application may be invokedby any user in a communication session. In some implementations, theassistant application may determine conversation context based on a flowof conversation and/or gestures in media content in the communicationsession, e.g., the conversation context may be determined based onspeech or video received from two or more users in the communicationsession.

FIG. 7 is a diagrammatic illustration of an example user interface 700,according to some implementations. A plurality of conversation contextphrases, e.g., phrases 702, 706, and 710 are illustrated, along withcorresponding information items 704, 708, and 712 that were provided ina communication session. For example, phrase 702 may be spoken beforephrase 706 and 710. User interface 700 presents a scrollable summary ofconversation context and information items provided during acommunication session in stacked form. A user may scroll, e.g., upwardsor downwards, to view earlier contexts and/or information items. In someimplementations, a user can access the scrollable summary during thecommunication session, e.g., by selecting a user interface element. Insome implementations, the scrollable summary may be provided after thecommunication session has been terminated. In some implementations, thescrollable summary may include one or more reminders, e.g., based on theassistant application identifying action items discussed during thecommunication session.

FIG. 8 is a diagrammatic illustration of an example user interface 800,according to some implementations. In the example illustrated in FIG. 8, a multi-party communication session is in progress. Users 810, 812,and 814 are participants in the communication session. User interface800 may be shown on a computing device, e.g., a computing device ofusers 810, 812, and 814. Users 810, 812, and 814 are engaged in a groupactivity in the communication session, mediated by an assistantapplication (or assistant program).

In the example illustrated in FIG. 8 , the group activity is a quiz. Indifferent implementations, the group activity can be any type ofactivity, e.g., viewing media, editing a document, etc. It is determinedthat the conversation context is a quiz, and in response, video of eachuser is displayed in a small size in the user interface 800. Userinterface 800 includes information items for the quiz, e.g., an image802 and a question 806 associated with the image. In this example, theuser “Player 3” has provided an answer 806, “Machu Picchu” to the quizquestion. The assistant application may determine that Player 3 hasprovided the answer, and in response, update a score of Player 3 in thequiz. Further, the assistant application may bring up a next question inthe quiz. In this manner, an assistant application may be invoked in acommunication session to provide or enhance multi-party userinteraction.

In some implementations, a visual representation of the assistantapplication may be included in user interface 800. In someimplementations, in addition to or alternatively to providing display ofuser interface 800, assistant application may also provide audio toparticipants of the communication session. For example, question 804 maybe read aloud and provided as audio in the communication session.

In some implementations, when users provide permission, the assistantapplication may record video and/ or audio during the quiz, and replaythe video, e.g., when a participant user answers a quiz questioncorrectly. In some implementations, the assistant application mayutilize media content, e.g., video, from the communication session inthe quiz, e.g., to recognize a user that raised their hand first, toprovide the quiz application. In some implementations, gesturerecognition techniques may be used for such features. Gesturerecognition may also be used for other features, e.g., to conduct a pollin a multi participant communication session.

FIG. 9 is a block diagram of an example device 900 which may be used toimplement one or more features described herein. In one example, device900 may be used to implement a client device, e.g., any of clientdevices 120-126 shown in FIG. 1 . Alternatively, device 900 canimplement a server device, e.g., server device 104, server device 142,etc. In some implementations, device 900 may be used to implement aclient device, a server device, or both client and server devices.Device 900 can be any suitable computer system, server, or otherelectronic or hardware device as described above.

One or more methods described herein can be run in a standalone programthat can be executed on any type of computing device, a program run on aweb browser, a mobile application (“app”) run on a mobile computingdevice (e.g., cell phone, smart phone, tablet computer, wearable device(wristwatch, armband, jewelry, headwear, virtual reality goggles orglasses, augmented reality goggles or glasses, head mounted display,etc.), laptop computer, etc.).

In one example, a client/server architecture can be used, e.g., a mobilecomputing device (as a client device) sends user input data to a serverdevice and receives from the server the final output data for output(e.g., for display). In another example, all computations can beperformed within the mobile app (and/or other apps) on the mobilecomputing device. In another example, computations can be split betweenthe mobile computing device and one or more server devices.

In some implementations, device 900 includes a processor 902, a memory904, and input/output (I/O) interface 906. Processor 902 can be one ormore processors and/or processing circuits to execute program code andcontrol basic operations of the device 900. A “processor” includes anysuitable hardware system, mechanism or component that processes data,signals or other information. A processor may include a system with ageneral-purpose central processing unit (CPU) with one or more cores(e.g., in a single-core, dual-core, or multi-core configuration),multiple processing units (e.g., in a multiprocessor configuration), agraphics processing unit (GPU), a field-programmable gate array (FPGA),an application-specific integrated circuit (ASIC), a complexprogrammable logic device (CPLD), dedicated circuitry for achievingfunctionality, a special-purpose processor to implement neural networkmodel-based processing, neural circuits, processors optimized for matrixcomputations (e.g., matrix multiplication), or other systems.

In some implementations, processor 902 may include one or moreco-processors that implement neural-network processing. In someimplementations, processor 902 may be a processor that processes data toproduce probabilistic output, e.g., the output produced by processor 902may be imprecise or may be accurate within a range from an expectedoutput. Processing need not be limited to a particular geographiclocation, or have temporal limitations. For example, a processor mayperform its functions in “real-time,” “offline,” in a “batch mode,” etc.Portions of processing may be performed at different times and atdifferent locations, by different (or the same) processing systems. Acomputer may be any processor in communication with a memory.

Memory 904 is typically provided in device 900 for access by theprocessor 902, and may be any suitable processor-readable storagemedium, such as random access memory (RAM), read-only memory (ROM),Electrically Erasable Read-only Memory (EEPROM), Flash memory, etc.,suitable for storing instructions for execution by the processor, andlocated separate from processor 902 and/or integrated therewith. Memory904 can store software operating on the server device 900 by theprocessor 902, including an operating system 908, machine-learningapplication 930, other applications 912, and application data 914. Otherapplications 912 may include applications such as a data display engine,web hosting engine, image display engine, notification engine, socialnetworking engine, etc. In some implementations, the machine-learningapplication 930 and other applications 912 can each include instructionsthat enable processor 902 to perform functions described herein, e.g.,some or all of the methods of FIGS. 3, and 4 .

Other applications 912 can include, e.g., image editing applications,media display applications, communication applications, assistantapplications, web hosting engines or applications, mapping applications,media sharing applications, etc. One or more methods disclosed hereincan operate in several environments and platforms, e.g., as astand-alone computer program that can run on any type of computingdevice, as a web application having web pages, as a mobile application(“app”) run on a mobile computing device, etc.

In various implementations, machine-learning application 930 may utilizeBayesian classifiers, support vector machines, neural networks, or otherlearning techniques. In some implementations, machine-learningapplication 930 may include a trained model 934, an inference engine936, and data 932. In some implementations, data 932 may includetraining data, e.g., data used to generate trained model 934. Forexample, training data may include any type of data such as text,images, audio, video, etc. Training data may be obtained from anysource, e.g., a data repository specifically marked for training, datafor which permission is provided for use as training data formachine-learning, etc. In implementations where one or more users permituse of their respective user data to train a machine-learning model,e.g., trained model 934, training data may include such user data. Inimplementations where users permit use of their respective user data,data 932 may include permitted data such as images (e.g., photos orother user-generated images), communications (e.g., e-mail; chat datasuch as text messages, voice, video, etc.), documents (e.g.,spreadsheets, text documents, presentations, etc.)

In some implementations, data 932 may include collected data such as mapdata, image data (e.g., satellite imagery, overhead imagery, etc.), gamedata, etc. In some implementations, training data may include syntheticdata generated for the purpose of training, such as data that is notbased on user input or activity in the context that is being trained,e.g., data generated from simulated conversations, computer-generatedimages, etc. In some implementations, machine-learning application 930excludes data 932. For example, in these implementations, the trainedmodel 934 may be generated, e.g., on a different device, and be providedas part of machine-learning application 930. In various implementations,the trained model 934 may be provided as a data file that includes amodel structure or form, and associated weights. Inference engine 936may read the data file for trained model 934 and implement a neuralnetwork with node connectivity, layers, and weights based on the modelstructure or form specified in trained model 934.

Machine-learning application 930 also includes a trained model 934. Insome implementations, the trained model may include one or more modelforms or structures. For example, model forms or structures can includeany type of neural-network, such as a linear network, a deep neuralnetwork that implements a plurality of layers (e.g., “hidden layers”between an input layer and an output layer, with each layer being alinear network), a convolutional neural network (e.g., a network thatsplits or partitions input data into multiple parts or tiles, processeseach tile separately using one or more neural-network layers, andaggregates the results from the processing of each tile), asequence-to-sequence neural network (e.g., a network that takes as inputsequential data, such as words in a sentence, frames in a video, etc.and produces as output a result sequence), etc.

The model form or structure may specify connectivity between variousnodes and organization of nodes into layers. For example, nodes of afirst layer (e.g., input layer) may receive data as input data 932 orapplication data 914. Such data can include, for example, one or morepixels per node, e.g., when the trained model is used for imageanalysis. Subsequent intermediate layers may receive as input output ofnodes of a previous layer per the connectivity specified in the modelform or structure. These layers may also be referred to as hiddenlayers. A final layer (e.g., output layer) produces an output of themachine-learning application. For example, the output may be a set oflabels for an image, a representation of the image that permitscomparison of the image to other images (e.g., a feature vector for theimage), an output sentence in response to an input sentence, one or morecategories for the input data, etc. depending on the specific trainedmodel. In some implementations, model form or structure also specifies anumber and/ or type of nodes in each layer.

In different implementations, trained model 934 can include a pluralityof nodes, arranged into layers per the model structure or form. In someimplementations, the nodes may be computational nodes with no memory,e.g., configured to process one unit of input to produce one unit ofoutput. Computation performed by a node may include, for example,multiplying each of a plurality of node inputs by a weight, obtaining aweighted sum, and adjusting the weighted sum with a bias or interceptvalue to produce the node output.

In some implementations, the computation performed by a node may alsoinclude applying a step/activation function to the adjusted weightedsum. In some implementations, the step/activation function may be anonlinear function. In various implementations, such computation mayinclude operations such as matrix multiplication. In someimplementations, computations by the plurality of nodes may be performedin parallel, e.g., using multiple processors cores of a multicoreprocessor, using individual processing units of a GPU, orspecial-purpose neural circuitry. In some implementations, nodes mayinclude memory, e.g., may be able to store and use one or more earlierinputs in processing a subsequent input. For example, nodes with memorymay include long short-term memory (LSTM) nodes. LSTM nodes may use thememory to maintain “state” that permits the node to act like a finitestate machine (FSM). Models with such nodes may be useful in processingsequential data, e.g., words in a sentence or a paragraph, frames in avideo, speech or other audio, etc.

In some implementations, trained model 934 may include embeddings orweights for individual nodes. For example, a model may be initiated as aplurality of nodes organized into layers as specified by the model formor structure. At initialization, a respective weight may be applied to aconnection between each pair of nodes that are connected per the modelform, e.g., nodes in successive layers of the neural network. Forexample, the respective weights may be randomly assigned, or initializedto default values. The model may then be trained, e.g., using data 932,to produce a result.

For example, training may include applying supervised learningtechniques. In supervised learning, the training data can include aplurality of inputs (e.g., a set of images) and a corresponding expectedoutput for each input (e.g., one or more labels for each image). Basedon a comparison of the output of the model with the expected output,values of the weights are automatically adjusted, e.g., in a manner thatincreases a probability that the model produces the expected output whenprovided similar input.

In some implementations, training may include applying unsupervisedlearning techniques. In unsupervised learning, only input data may beprovided and the model may be trained to differentiate data, e.g., tocluster input data into a plurality of groups, where each group includesinput data that are similar in some manner. For example, the model maybe trained to differentiate images such that the model distinguishesabstract images (e.g., synthetic images, human-drawn images, etc.) fromnatural images (e.g., photos).

In another example, a model trained using unsupervised learning maycluster words based on the use of the words in input sentences. In someimplementations, unsupervised learning may be used to produce knowledgerepresentations, e.g., that may be used by machine-learning application930. In various implementations, a trained model includes a set ofweights, or embeddings, corresponding to the model structure. Inimplementations where data 932 is omitted, machine-learning application930 may include trained model 934 that is based on prior training, e.g.,by a developer of the machine-learning application 930, by athird-party, etc. In some implementations, trained model 934 may includea set of weights that are fixed, e.g., downloaded from a server thatprovides the weights.

Machine-learning application 930 also includes an inference engine 936.Inference engine 936 is configured to apply the trained model 934 todata, such as application data 914, to provide an inference. In someimplementations, inference engine 936 may include software code to beexecuted by processor 902. In some implementations, inference engine 936may specify circuit configuration (e.g., for a programmable processor,for a field programmable gate array (FPGA), etc.) enabling processor 902to apply the trained model. In some implementations, inference engine936 may include software instructions, hardware instructions, or acombination. In some implementations, inference engine 936 may offer anapplication programming interface (API) that can be used by operatingsystem 908 and/or other applications 912 to invoke inference engine 936,e.g., to apply trained model 934 to application data 914 to generate aninference.

Machine-learning application 930 may provide several technicaladvantages. For example, when trained model 934 is generated based onunsupervised learning, trained model 934 can be applied by inferenceengine 936 to produce knowledge representations (e.g., numericrepresentations) from input data, e.g., application data 914. Forexample, a model trained for image analysis may produce representationsof images that have a smaller data size (e.g., 1 KB) than input images(e.g., 10 MB). In some implementations, such representations may behelpful to reduce processing cost (e.g., computational cost, memoryusage, etc.) to generate an output (e.g., a label, a classification, asentence descriptive of the image, etc.). In some implementations, suchrepresentations may be provided as input to a different machine-learningapplication that produces output from the output of inference engine936.

In some implementations, knowledge representations generated bymachine-learning application 930 may be provided to a different devicethat conducts further processing, e.g., over a network. In suchimplementations, providing the knowledge representations rather than theimages may provide a technical benefit, e.g., enable faster datatransmission with reduced cost. In another example, a model trained forclustering documents may produce document clusters from input documents.The document clusters may be suitable for further processing (e.g.,determining whether a document is related to a topic, determining aclassification category for the document, etc.) without the need toaccess the original document, and therefore, save computational cost.

In some implementations, machine-learning application 930 may beimplemented in an offline manner. In these implementations, trainedmodel 934 may be generated in a first stage, and provided as part ofmachine-learning application 930. In some implementations,machine-learning application 930 may be implemented in an online manner.For example, in such implementations, an application that invokesmachine-learning application 930 (e.g., operating system 908, one ormore of other applications 912) may utilize an inference produced bymachine-learning application 930, e.g., provide the inference to a user,and may generate system logs (e.g., if permitted by the user, an actiontaken by the user based on the inference; or if utilized as input forfurther processing, a result of the further processing). System logs maybe produced periodically, e.g., hourly, monthly, quarterly, etc. and maybe used, with user permission, to update trained model 934, e.g., toupdate embeddings for trained model 934.

In some implementations, machine-learning application 930 may beimplemented in a manner that can adapt to particular configuration ofdevice 900 on which the machine-learning application 930 is executed.For example, machine-learning application 930 may determine acomputational graph that utilizes available computational resources,e.g., processor 902. For example, if machine-learning application 930 isimplemented as a distributed application on multiple devices,machine-learning application 930 may determine computations to becarried out on individual devices in a manner that optimizescomputation. In another example, machine-learning application 930 maydetermine that processor 902 includes a GPU with a particular number ofGPU cores (e.g., 1000) and implement the inference engine accordingly(e.g., as 1000 individual processes or threads).

In some implementations, machine-learning application 930 may implementan ensemble of trained models. For example, trained model 934 mayinclude a plurality of trained models that are each applicable to sameinput data. In these implementations, machine-learning application 930may choose a particular trained model, e.g., based on availablecomputational resources, success rate with prior inferences, etc. Insome implementations, machine-learning application 930 may executeinference engine 936 such that a plurality of trained models is applied.In these implementations, machine-learning application 930 may combineoutputs from applying individual models, e.g., using a voting-techniquethat scores individual outputs from applying each trained model, or bychoosing one or more particular outputs. Further, in theseimplementations, machine-learning application may apply a time thresholdfor applying individual trained models (e.g., 0.5 ms) and utilize onlythose individual outputs that are available within the time threshold.Outputs that are not received within the time threshold may not beutilized, e.g., discarded. For example, such approaches may be suitablewhen there is a time limit specified while invoking the machine-learningapplication, e.g., by operating system 908 or one or more applications912.

In different implementations, machine-learning application 930 canproduce different types of outputs. For example, machine-learningapplication 930 can provide representations or clusters (e.g., numericrepresentations of input data), labels (e.g., for input data thatincludes images, documents, etc.), phrases or sentences (e.g.,descriptive of an image or video, suitable for use as a response to aninput sentence, suitable for use to determine context during aconversation, etc.), images (e.g., generated by the machine-learningapplication in response to input), audio or video (e.g., in response aninput video, machine-learning application 930 may produce an outputvideo with a particular effect applied, e.g., rendered in a comic-bookor particular artist’s style, when trained model 934 is trained usingtraining data from the comic book or particular artist, etc. In someimplementations, machine-learning application 930 may produce an outputbased on a format specified by an invoking application, e.g. operatingsystem 908 or one or more applications 912. In some implementations, aninvoking application may be another machine-learning application. Forexample, such configurations may be used in generative adversarialnetworks, where an invoking machine-learning application is trainedusing output from machine-learning application 930 and vice-versa.

Any of software in memory 904 can alternatively be stored on any othersuitable storage location or computer-readable medium. In addition,memory 904 (and/or other connected storage device(s)) can store one ormore messages, one or more taxonomies, electronic encyclopedia,dictionaries, thesauruses, knowledge bases, message data, grammars, userpreferences, and/or other instructions and data used in the featuresdescribed herein. Memory 904 and any other type of storage (magneticdisk, optical disk, magnetic tape, or other tangible media) can beconsidered “storage” or “storage devices.”

I/O interface 906 can provide functions to enable interfacing the serverdevice 900 with other systems and devices. Interfaced devices can beincluded as part of the device 900 or can be separate and communicatewith the device 900. For example, network communication devices, storagedevices (e.g., memory and/or database 106), and input/output devices cancommunicate via I/O interface 906. In some implementations, the I/Ointerface can connect to interface devices such as input devices(keyboard, pointing device, touchscreen, microphone, camera, scanner,sensors, etc.) and/or output devices (display devices, speaker devices,printers, motors, etc.).

Some examples of interfaced devices that can connect to I/O interface906 can include one or more display devices 920 that can be used todisplay content, e.g., images, video, and/or a user interface of anoutput application as described herein. Display device 920 can beconnected to device 900 via local connections (e.g., display bus) and/orvia networked connections and can be any suitable display device.Display device 920 can include any suitable display device such as anLCD, LED, or plasma display screen, CRT, television, monitor,touchscreen, 3-D display screen, or other visual display device. Forexample, display device 920 can be a flat display screen provided on amobile device, multiple display screens provided in a goggles or headsetdevice, or a monitor screen for a computer device.

The I/O interface 906 can interface to other input and output devices.Some examples include one or more cameras which can capture images. Someimplementations can provide a microphone for capturing sound (e.g., as apart of captured images, voice commands, etc.), audio speaker devicesfor outputting sound, or other input and output devices.

For ease of illustration, FIG. 9 shows one block for each of processor902, memory 904, I/O interface 906, and software blocks 908, 912, and930. These blocks may represent one or more processors or processingcircuitries, operating systems, memories, I/O interfaces, applications,and/or software modules. In other implementations, device 900 may nothave all of the components shown and/or may have other elementsincluding other types of elements instead of, or in addition to, thoseshown herein. While some components are described as performing blocksand operations as described in some implementations herein, any suitablecomponent or combination of components of environment 100, device 900,similar systems, or any suitable processor or processors associated withsuch a system, may perform the blocks and operations described.

Methods described herein can be implemented by computer programinstructions or code, which can be executed on a computer. For example,the code can be implemented by one or more digital processors (e.g.,microprocessors or other processing circuitry) and can be stored on acomputer program product including a non-transitory computer readablemedium (e.g., storage medium), such as a magnetic, optical,electromagnetic, or semiconductor storage medium, includingsemiconductor or solid state memory, magnetic tape, a removable computerdiskette, a random access memory (RAM), a read-only memory (ROM), flashmemory, a rigid magnetic disk, an optical disk, a solid-state memorydrive, etc. The program instructions can also be contained in, andprovided as, an electronic signal, for example in the form of softwareas a service (SaaS) delivered from a server (e.g., a distributed systemand/or a cloud computing system). Alternatively, one or more methods canbe implemented in hardware (logic gates, etc.), or in a combination ofhardware and software. Example hardware can be programmable processors(e.g. Field-Programmable Gate Array (FPGA), Complex Programmable LogicDevice), general purpose processors, graphics processors, ApplicationSpecific Integrated Circuits (ASICs), and the like. One or more methodscan be performed as part of or component of an application running onthe system, or as an application or software running in conjunction withother applications and operating system.

Although the description has been described with respect to particularimplementations thereof, these particular implementations are merelyillustrative, and not restrictive. Concepts illustrated in the examplesmay be applied to other examples and implementations.

Note that the functional blocks, operations, features, methods, devices,and systems described in the present disclosure may be integrated ordivided into different combinations of systems, devices, and functionalblocks as would be known to those skilled in the art. Any suitableprogramming language and programming techniques may be used to implementthe routines of particular implementations. Different programmingtechniques may be employed, e.g., procedural or object-oriented. Theroutines may execute on a single processing device or multipleprocessors. Although the steps, operations, or computations may bepresented in a specific order, the order may be changed in differentparticular implementations. In some implementations, multiple steps oroperations shown as sequential in this specification may be performed atthe same time.

1. A computer-implemented method comprising: determining a userpreference associated with a user for a particular virtual assistantfrom a set of assistants; receiving, during a video communicationsession between a first computing device associated with the user and asecond computing device, first session content from the videocommunication session; detecting that the first session content includesa request for media; selecting the particular virtual assistant based onthe user preference; and sending, by the particular virtual assistant, afirst command to at least one of the first computing device or thesecond computing device to display the media.
 2. The method of claim 1,further comprising: requesting, by the particular virtual assistant,output from a second virtual assistant of the set of assistants, whereinthe second virtual assistant provides a different service than theparticular virtual assistant is operable to provide.
 3. The method ofclaim 2, wherein the different service is translation of text-to-speechto provide speech output in a target language that is understood by theuser.
 4. The method of claim 1, wherein determining the user preferenceassociated with the user is based on the user explicitly providing theuser preference for the particular virtual assistant.
 5. The method ofclaim 1, wherein determining the user preference associated with theuser is based on at least one selected from the group of user feedback,the user performing an action based on the particular virtual assistantsending the first command, the user choosing an option from theparticular virtual assistant that is not offered by other virtualassistants in the set of assistants, the user providing an indication ofuser satisfaction, and combinations thereof.
 6. The method of claim 1,wherein the video communication session includes video that includes aface, and wherein the first command causes display of the media suchthat the face is not obscured.
 7. The method of claim 1, whereindetecting that the first session content includes the request for mediais based on determining from conversation context of the videocommunication session that an implicit invocation of the particularvirtual assistant is associated with a confidence score that exceeds ascore threshold.
 8. A computing device comprising: a processor; and amemory coupled to the processor, with instructions stored thereon that,when executed by the processor, cause the processor to performoperations comprising: determining a user preference associated with auser for a particular virtual assistant from a set of assistants;receiving, during a video communication session between a firstcomputing device associated with the user and a second computing device,first session content from the video communication session; detectingthat the first session content includes a request for media; selectingthe particular virtual assistant based on the user preference; andsending, by the particular virtual assistant, a first command to atleast one of the first computing device or the second computing deviceto display the media.
 9. The computing device of claim 8, wherein theoperations further comprise: requesting, by the particular virtualassistant, output from a second virtual assistant of the set ofassistants, wherein the second virtual assistant provides a differentservice than the particular virtual assistant is operable to provide.10. The computing device of claim 9, wherein the different service istranslation of text-to-speech to provide speech output in a targetlanguage that is understood by the user.
 11. The computing device ofclaim 8, wherein determining the user preference associated with theuser is based on the user explicitly providing the user preference forthe particular virtual assistant.
 12. The computing device of claim 8,wherein determining the user preference associated with the user isbased on at least one selected from the group of user feedback, the userperforming an action based on the particular virtual assistant sendingthe first command, the user choosing an option from the particularvirtual assistant that is not offered by other virtual assistants in theset of assistants, the user providing an indication of usersatisfaction, and combinations thereof.
 13. The computing device ofclaim 8, wherein the video communication session includes video thatincludes a face, and wherein the first command causes display of themedia such that the face is not obscured.
 14. The computing device ofclaim 8, wherein detecting that the first session content includes therequest for media is based on determining from conversation context ofthe video communication session that an implicit invocation of theparticular virtual assistant is associated with a confidence score thatexceeds a score threshold.
 15. A non-transitory computer-readable mediumwith instructions stored thereon that, when executed by one or morecomputers, cause the one or more computers to perform operations, theoperations comprising: determining a user preference associated with auser for a particular virtual assistant from a set of assistants;receiving, during a video communication session between a firstcomputing device associated with the user and a second computing device,first session content from the video communication session; detectingthat the first session content includes a request for media; selectingthe particular virtual assistant based on the user preference; andsending, by the particular virtual assistant, a first command to atleast one of the first computing device or the second computing deviceto display the media.
 16. The non-transitory computer-readable medium ofclaim 15, wherein the operations further comprise: requesting, by theparticular virtual assistant, output from a second virtual assistant ofthe set of assistants, wherein the second virtual assistant provides adifferent service than the particular virtual assistant is operable toprovide.
 17. The non-transitory computer-readable medium of claim 16,wherein the different service is translation of text-to-speech toprovide speech output in a target language that is understood by theuser.
 18. The non-transitory computer-readable medium of claim 15,wherein determining the user preference associated with the user isbased on the user explicitly providing the user preference for theparticular virtual assistant.
 19. The non-transitory computer-readablemedium of claim 15, wherein determining the user preference associatedwith the user is based on at least one selected from the group of userfeedback, the user performing an action based on the particular virtualassistant sending the first command, the user choosing an option fromthe particular virtual assistant that is not offered by other virtualassistants in the set of assistants, the user providing an indication ofuser satisfaction, and combinations thereof.
 20. The non-transitorycomputer-readable medium of claim 15, wherein the video communicationsession includes video that includes a face, and wherein the firstcommand causes display of the media such that the face is not obscured.